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ABSTRACT The recent ability to sequence whole genomes 
allows ready access to all genetic material. The approaches 
outlined here allow automated analysis of sequence for the 
synthesis of optimal primers in an automated multiplex 
oligonucleotide^synthesizer (AMOS): The efficiency, is such 
that all ORFs for an organism can be amplified by PCR. The 
resulting amplicons can be used directly in the construction of 
DNA arrays or can be cloned for a large variety of functional 
analyses. These tools allow a replacement of single-gene 
analysis with a highly efficient whole-genome analysis. 



The genonie sequencing projects :have generated and will 
continue to generate enormous amounts of sequence data. The 
genomes of Saccharomyces cereyisiae, Escherichia coli, Hae- 
mophilus influenzae (1), Mycoplasma genitalium (2), and Meth- 
anococcus jannaschii (3) have been completely sequenced. 
Other model organisms have had substantial portions of their 
genomes sequenced as well, including the nematode . Cqeno- 
rhabditis elegans (4) and the small flowering plant Arabidopsis 
thaliana (5). This massive and increasing amount of sequence 
information allows the development of novel experimental 
approaches to identify gene function. 

One standard use of genome sequence data is to attempt to 
identify the functions of predicted open reading frames 
(ORFs) within the genome by comparison to genes of known 
function. Such a comparative analysis of all ORFs to existing 
sequence data is fast, simple, and requires no experimentation 
and is therefore a reasonable first step. While finding sequence 
homologies/motifs is not a substitute for experimentation, 
noting the presenceof sequence homology and/or sequence 
motifs can be a useful first step in finding interesting genes, in 
designing experiments and, in some cases, predicting function. 
However, this type of analysis is frequently un informative. For 
example, over . one-half of new ORFs in S. cerevisiae have no 
known function (6). If this is the case in.a well studied organism 
such as yeast, the problem will be even worse in organisms that 
are less well studied or less manipulable. A large, experimen- 
tally determined gene function database would make homol- 
ogy/motif searches much more useful. 

Experimental analysis must be performed to thoroughly 
understand the biological function of a gene product. Scaling 
up from classical "cottage industry" one-gene-oriented ap- 
proaches to whole-genome analysis would be very expensive 
and laborious. It is clear that novel strategies are necessary to 
efficiently pursue the next phase of the genome projects — 
whole-genome experimental analysis to explore gene expres- 
sion, gene product function* and other genome functions. 
Model organisms, such as S. cerevisiae, will be extremely 
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important in the development of novel whole-genome analysis 
techniques and, subsequently, in improving bur understanding 
of other more complex andjess manipulable organisms. 

The genome sequence can be systematically used as a tool 
to understand ORFs, gene; product function, and other ge- 
nome regions. Toward this end, a directed strategy has been 
developed for exploiting sequence information as a means of 
providing information. about biological function (Fig. 1). Ef- 
forts have been directed toward the amplification of each 
predicted ORF or any other region of the genome ranging 
from a few base pairs to several kilobase pairs. There are many 
uses for these amplicons — they can be cloned into standard 
vectors or specialized expression vectors, or can be cloned into 
other specialized vectors; such as those used 1 for' two-hybrid 
analysis. The amplicons can .also be used directly by, for 
example, arraying onto glass for expression analysis, for DNA 
binding. assays, , or for. any direct DNA assay . (7). As. a pilot 
study, .synthetic, primers were; made on. the 96-well. automated 
multiplex oligonucleotide synthesizer (AMOS) instrument (8) 
(Fig. 2). These oligonucleotides were used to amplify, each 
ORF on yeast chromosome V. iThe current version of this 
instrument can synthesize three -plates of 96 oligonucleotides 
each (25 bases) in an 8-hr day. The amplification of the entire 
set of PCR products was . then analyzed by gel electrophoresis 
(Fig. 3). Successful amplification [Of the proper, length product 
on , the first attempt was 95%. This project . demonstrates that 
one can go directly from sequence information to biological 
analysis in a truly automated, totally directed manner,;,; . 

These amplicons can be incorporated directly in arrays or 
the amplicons can be cloned. If the amplicons are to be cloned, 
novel sequences can be incorporated at the 5' end of the 
oligonucleotide to facilitate cloning. One potential problem 
with cloning PCR products is that the cloned amplicons may 
contain sequence alterations that diminish their utility. One 
option would be to resequence each individual amplicon. 
However, this is expensive, inefficient, and time consuming. A 
faster, more cost-effective, and more accurate approach is to 
apply comparative sequencing by denaturing HPLC (9). This 
method is capable of detecting a single base change in a 2-kb 
heteroduplex. Longer amplicons can be analyzed by use of 
appropriate restriction fragments. If any change is detected in 
a clone, an alternate clone of the same region can be analyzed. 
Modifying the system to allow high throughput analysis by 
denaturing HPLC is also relatively simple and straightforward. 

If amplicons are used directly on arrays without cloning, it 
is important to note that, even if single PCR product bands are 
observed on gels, the PCR products will be contaminated with 
various amounts of other sequences. This contamination has 
the potential to affect the results in, for example, expression 
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Fig. 1. Overview of systematic method for isolating individual 
genes., Sequence information 1 is obtained automatically from sequence 
databases. The data are input into primer selection software specifi- 
cally designed to target ORFs as designated by database annotations. 
The output file containing the primer information is directly read by 
a high-throughput oligonucleotide synthesizer, which makes the oli- 
gonucleotides in 96-well plates (AMOS, automated multiplex oligo- 
nucleotide synthesizer). The forward and reverse primers are synthe- 
sized in the same location on separate plates to facilitate' the down- 
stream handling of primers. The amplicons are generated by PCR in 
96-well plates as well. ' • . 1 ■ ■ . ; . r, > 

analysis. On the other hand, direct use of the amplicons is 
much less labor intensive and greatly decreases the occurrence 
of mistakes in clone, identification, a .ubiquitous problem 
associated with large clone set archiving and, retrieving. 

Any large-scale effort to capture each ORF within a genome 
must rely on automation if cost is to be minimized .while 
efficiency is maximized. Toward that end, primers, targeting 
ORFs were designed automatically using simple new scripts 
and existing primer selection software. These script-selected 
primer sequences were directly read by the high-throughput 
synthesizer and the forward and reverse primers were synthe- 
sized in separate plates in corresponding wells, to facilitate 
automated pipetting and PCR amplifications. Each, of the 
resulting PCR products, generated with minimum labor, con- 
tains a known, unique ORF. 

Large-scale genome analysis projects, are dependent on 
newly emerging technologies to make the studies practical and 
economically feasible. For example, the cost of the primers, a 
significant issue in the past, has been reduced dramatically to 
make feasible this and other projects that require tens of 
thousands of oligonucleotides. Other methods of high- 
throughput analysis are also vital to the success of functional 
analysis projects, such as microarraying and oligonucleotide 
chip methods (10-14). 

Changes in attitude are also required. One of the major costs 
of commercial oligonucleotides is extensive quality control 
such that virtually 100% of the supplied oligonucleotides are 
successfully synthesized and work for their intended purpose. 
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Fig. 2. Overall approach for using database of a genome to direct 
biological analysis. The synthesis of the 6,000 ORFs (orfs) for each 
gene of S, cerevisiae can be used in many applications utilizing both 
cloning and microarraying technology. 

Considerable cost reduction can be obtained by, simply de- 
creasing the expected successful synthesis rate to 95-97%. One 
can then achieve faster and cheaper whole genome coverage by 
simply adding a single quality control at the end of the 
experiment and batching the failures for resynthesis. 

The directed nature of the amplicon approach is of clear 
advantage. The sequence of each ORF is, analyzed automati- 
cally, and unique specific primers are made to target T; each 
ORF. Thus, there is relatively littletime or labor involved— for 
example, no random cloning and subsequent screening is 
required because each product is known. In the test system, 
primers for 240 ORFs from chromosome V were systematically 
synthesized, beginning from the left arm and continuing 
through to the right arm. At no point was there any manual 
analysis of sequence information to generate the collection. In 
many ways, now that the sequence is known, there is no need 
for the researcher to examine it. 

These amplicons can be arrayed and expression analysis can 
be done on all arrayed ORFs with a single hybridization (10). 
Those ORFs that display significant differential expression 
patterns under a given selection are easily identified without 
the laborious task of searching for and then sequencing a clone. 
Once scaled up, the procedure provides even greater returns 
on effort, because a single hybridization will ultimately provide 
a "snapshot" of the expression of all genes in the yeast genome. 
Thus, the limiting factor in whole genome analysis will not be 
the analysis process itself, but will instead be the ability of 
researchers to design and carry out experimental selections. 

Current expression and genetic analysis technologies are 
geared toward the analysis of single genes and are ill suited to 
analyze numerous genes under many conditions. Additional 
difficulties with current technologies include: the effort and 
expense required to analyze expression and make mutants, the 
potential duplication of effort if done by different laboratories, 
and the possibility of conflicting results obtained from differ- 
ent laboratories. In contrast, whole genome analysis not only 
is more efficient, it also provides data of much higher quality; 
all genes are assayed and compared in parallel under exactly 
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Fig. 3. Gel image of amplifications. Using the method described in Fig. 1, amplicons were generated for ORFs of S. cerevisiae chromosome 
V. One plate of 96 amplification reactions is shown. 



the same conditions. In addition, amplicons have many appli- 
cations beyond gene expression. For example, one recent 
approach is to incorporate a unique DNA sequence tag, 
synthesized as part of each gene specific primer, during 
amplification. The tags or molecular bar codes, when reintro- 
duced into the organism as a gene deletion or as a gene clone, 
can be used much more efficiently than individual mutations 
or clones because pools of tagged mutants or transformants 
can be analyzed in parallel. This parallel analysis is possible 
because the tags are readily and quantitatively amplified even 
in complex mixtures of tags (13). 

These ORF genome arrays and oligonucleotide tagged 
libraries can be used for many applications. Any conventional 
selection applied to a library that gives discrete or multiple 
products can use these technologies for a simple direct read- 
out. These include screens and selections for mutant comple- 
mentation, overexpression suppression (15, 16), second-site 
suppressors, synthetic lethality, drug target overexpression 
(17), two-hybrid screens (18), genome mismatch scanning (19), 
or recombination mapping. 

The genome projects have provided researchers with a vast 
amount of information. These data must be used efficiently 
and systematically to gain a truly comprehensive understand- 
ing of gene function and, more broadly, of the entire genome 
which can then be applied to other organisms. Such global 
approaches are essential if we are to gain an understanding of 
the living cell. This understanding should come from the 
viewpoint of the integration of complex regulatory networks, 
the individual roles and interactions of thousands of functional 
gene products, and the effect of environmental changes on 
both gene regulatory networks and the roles of all gene 
products. The time has come to switch from the analysis of a 
single gene to the analysis of the whole genome. 

Support was provided by National Institutes of Health Grants 
R37H60198 and P01H600205. 



1. Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., 
Kirkness, E. F., et al (1995) Science 269, 496-512. 

2. Fraser, C. M., Gocayne, J. D., White, O., Adams, M. D., Clayton, 
R. A., et al. (1995) Science 270, 397-403. 

3. Bult, C. J., White, O., Olsen, G. J., Zhou, L., Fleischmann, R. D., 
et al (1996) Science 273, 1058-1073. 

4. Sulston, J., Du, Z., Thomas, K., Wilson, R., Hillier, L., Staden, R., 
Halloran, N., Green, P., Thierry-Mieg, J., Qiu, L., Dear, S., 
Coulson, A, Craxton, M., Durbin, R., Berks, M., Metzstein, M., 
Hawkins, T., Ainscough, R. & Waterston, R. (1992) Nature 
(London) 356, 37-41. 

5. Newman, T., de Bruijn, F. J., Green, P., Keegstra, K,, Kende, H., 
et al (1994) Plant Physiol 106, 1241-1255. 

6. Oliver, S. (1996) Nature (London) 379, 597-600. 

7. Lashkari, D. A. (1996) Ph.D. dissertation (Stanford Univ., 
Stanford, CA). 

8. Lashkari, D. A., Hunicke-Smith, S. P., Norgren, R. M, Davis, 
R. W. & Brennan, T. (1995) Proc. Natl Acad. Sci USA 92, 
7912-7915. 

9. Oefner, P. J. & Underhill, P. A. (1995) Am. J. Hum. Genet. 57, 
A266. 

10. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. (1995) 
Science 270, 467-470. 

11. Fodor, S. P., Read, J. L., Pirrung, M. C, Stryer, L., Lu, A. T. & 
Solas, D. (1991) Science 251, 767-773. 

12. Chee, M., Yang, R., Hubbell, E., Berno, A, Huang, X. C, Stern, 
D., Winkler, J., Lockhart, D. J., Morris, M. S. & Fodor, S. P. 
(1996) Science 274, 610-614. 

13. Shoemaker, D. D., Lashkari, D. A, Morris, D., Mittmann, M. & 
Davis, R. W. (1996) Nat. Genet. 14, 450-456. 

14. Smith, V., Chou, K., Lashkari, D., Botstein, D. & Brown, P. O. 
(1996) Science 274, 2069-2074. 

15. Magdolen, V., Drubin, D. G., Mages, G. & Bandlow, W. (1993) 
FEBS Lett. 316, 41-47. 

16. Ramer, S. W., Elledge, S. J. & Davis, R. W. (1992) Proc. Natl 
Acad. Sci. USA 89, 11589-11593. 

17. Rine, J., Hansen, W., Hardeman, E. & Davis, R. W. (1983) Proc. 
Natl Acad. Sci. USA 80, 6750-6754. 

18. Fields, S. & Song, O. (1989) Nature (London) 340, 245-246. 

19. Nelson, S. F., McCusker, J. H., Sander, M. A, Kee, Y., Modrich, 
P. & Brown, P. O. (1994) Nat. Genet. 4, 11-18. 



Docket No.: PF-0594USN 
USSN: 09/786,797 
Ref.No. 3 of_9_ 



MOLECULAR CARCINOGENESIS 24:153-159 (1999) 




IN PERSPECTIVE 



Claudio J. Conti, Editor 

Microarrays and Toxicology: The Advent of 
Toxicogenomics 

Emile F. Nuwaysir, 1 Michael Bittner, 2 Jeffrey Trent, 2 J. Carl Barrett, 1 and Cynthia A. Afshari 1 

1 Laboratory of Molecular Carcinogenesis, National Institute of Environmental Health Sciences, Research Triangle Park, 
North Carolina 

laboratory of Cancer Genetics, National Human Genome Research Institute, Bethesda, Maryland 

The availability of genome-scale DNA sequence information and reagents has radically altered life-science 
research. This revolution has led to the development of a new scientific subdiscipline derived from a combina- 
tion of the fields of toxicology and genomics. This subdiscipline, termed toxicogenomics, is concerned with the 
identification of potential human and environmental toxicants, and their putative mechanisms of action, through 
the use of genomics resources. One such resource is DNA microarrays or "chips," which allow the monitoring of 
the expression levels of thousands of genes simultaneously. Here we propose a general method by which gene 
expression, as measured by cDNA microarrays, can be used as a highly sensitive and informative marker for 
toxicity. Our purpose is to acquaint the reader with the development and current state of microarray technol- 
ogy and to present our view of the usefulness of microarrays to the field of toxicology. Mol. Carcinog. 24:153- 

159, 1999. © 1999 Wiley-Liss, Inc. 

Key words: toxicology; gene expression; animal bioassay 



INTRODUCTION 

Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences, 
totaling over 2.2 billion bases [1]; are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. The 
first complete sequence of a free-living organism, 
Haemophilus influenzae, was reported in 1995 [3] and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cervisiae [4]. 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion of the Homo sapiens DNA sequence is not far 
behind [5]. 

To exploit more fully the wealth of new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9], and cDNA- and oligonucleotide-based microarray 
"chip" hybridization [10-12] are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 



Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 

cDNA Microarrays 

In the past several years, numerous systems were 
developed for the construction of large-scale DNA 
arrays. All of these platforms are based on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 
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[13,14], Sample detection for microarrays on glass 
involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reactions in the presence of fluorescently 
tagged dUTP (e.g., Cy3-dUTP and Cy5-dUTP), which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe," are then 
mixed and hybridized to the array under a glass cov- 
erslip [10,11/15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
for fluor excitation [10,11/15]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background [16,17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20] . The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Saccharomyces cervisiae [21]. 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7,22]. This method is useful 
for expression profiling and large-scale screening and 
mapping of genomic or cDNA clones [7,22-24]. In 
expression profiling on filter membranes, two dif- 
ferent membranes are used simultaneously for con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [28-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing by hybrid- 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by photoli- 
thography is theoretically simple but technically 
complex [29,30]. The light from a high -intensity 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting in 
deprotection of the terminal nucleotides in the illu- 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires only 4n cycles 
(where n = oligonucleotide length in bases) to syn- 
thesize a vast number of unique oligos, the total num- 
ber of which is limited only by the complexity of the 
photolithographic mask and the chip size [29,31,32]. 

Sample preparation involves the generation of 
double-stranded cDNA from cellular poly(A)+ RNA 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin (e.g., phycoerythrin) 
after hybridization [12,33]. The signal is detected with 
a custom confocal scanner [34]. This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28,36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37]. In addition, 
mutations in the cystic fibrosis [38] and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene [40] 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring [33] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearly all open 
reading frames of the yeast strain S. cerevisiae [12]. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human [41] and yeast [42] genomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 

Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat, mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test, 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by, and results in, alterations in gene expression. In 
many cases, these changes in gene expression are a 
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far more sensitive, characteristic, and measurable 
endpoint than the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
mative and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototypic class (e.g., polycyclic aromatic hydrocar- 
bons (PAHs)). Cells are then treated with these agents 
at a fixed toxicity level (as measured by cell survival), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarray chip (Figure 1). We have developed a cus- 
tom DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
cants, termed a toxicant signature, is determined. 

This signature is derived by ranking across all ex- 
periments the gene-expression data based on rela- 

Control 
Population 



tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors, PAHs, and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it maybe necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g., thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2, we developed 
the custom cDNA microarray chip ToxChip vl.O. 

Treated 
Population 



J RNA Isolation 



W \* ^. Reverse 

I Transcription 



A Mix cDNAs and 
n Apply to Array 



7. 



DNA "Chip" 



Hybridize Under 
Coverslip 




Figure 1. Simplified overview of the method for sample trative purposes, samples derived from cell culture are depicted, 
preparation and hybridization to cDNA microarrays. For illus- although other sample types are amenable to this analysis. 
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Figure 2. Schematic representation of the method for iden- 
tification of a toxicant's mechanism of action. In this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 



consistent changes in group A genes (indicated by red and 
green circles), but not group B or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, and a putative mechanism of action is assigned to 
the unknown agent. 



The 2090 human genes that comprise this subarray 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome proliferators, 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, tumor suppressor genes, cyclins, 
kinases, phosphatases, cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects on the expression of these housekeep- 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



are displayed [11], This facilitates rapid, visual in- 
terpretation of data. We are also developing Tox- 
Chip v2.0 and chips for other model systems, 
including rat, mouse, Xenopus, and yeast, for use in 
toxicology studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use of animals as model systems for toxicology test- 
ing. Unfortunately, these assays are inherently ex- 
pensive, require large numbers of animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(NIEHS), the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bioassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxicity, reproductive and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43]. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 
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Table 1. ToxChip v1.0: A Human cDNA Microarray 
Chip Designed to Detect Responses to Toxic Insult 

No. of genes 



Gene category on chip 



Apoptosis 72 

DNA replication and repair 99 

Oxidative stress/redox homeostasis 90 

Peroxisome proliferator responsive 22 

Dioxin/PAH responsive 12 

Estrogen responsive 63 

Housekeeping 84 

Oncogenes and tumor suppressor genes 76 

Cell-cycle control 51 

Transcription factors 131 

Kinases 276 

Phosphatases 88 

Heat-shock proteins 23 

Receptors 349 

Cytochrome P450s 30 



*This list is intended as a general guide. The gene categories are not 
unique, and some genes are listed in multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
select a bioassay more specifically suited to the agent 
in question or perhaps suggest that a bioassay is not 
necessary; which would dramatically reduce cost, 
animal use, and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
screened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mals. In addition, gene-expression changes are nor- 
mally measured in hours or days, not in the months 
to years required for tumor development. Further- 
more, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 



also be improved by the addition of microarray analy- 
sis. The combination of microarrays with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring, 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint, gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays could be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
lymphocytes of Polish coke-oven workers exposed 
to PAHs (and many other compounds) is under con- 
sideration at the NIEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way [44,45]. 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 
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Alleles, Oligo Arrays, and Toxicogenetics 

Gene sequences vary between individuals, and 
this variability can be a causative factor in human 
diseases of environmental origin [46,47]. A new area 
of toxicology, termed toxicogenetics, was recently 
developed to study the relationship between genetic 
variability and toxicant susceptibility. This field is 
not the subject of this discussion, but it is worth- 
while to note that the ability of oligonucleotide ar- 
rays to discriminate DNA molecules based on single 
base-pair differences makes these arrays uniquely 
useful for this type of analysis. Recent reports dem- 
onstrated the feasibility of this approach [41,42]. 
The NIEHS has initiated the Environmental Genome 
Project to identify common sequence polymor- 
phisms in 200 genes thought to be involved in en- 
vironmental diseases [48]. In a pilot study on the 
feasibility of this application to the Environmental 
Genome Project, oligonucleotide arrays will be used 
to resequence 20 candidate genes. This toxicogenetic 
approach promises to dramatically improve our un- 
derstanding of interindividual variability in disease 
susceptibility. 

FUTURE PRIORITIES 

There are many issues that must be addressed be- 
fore the full potential of microarrays in toxicology 
research can be realized. Among these are model sys- 
tem selection, dose selection, and the temporal na- 
ture of gene expression. In other words, in which 
species, at what dose, and at what time do we look 
for toxicant-induced gene expression? If human 
samples are analyzed, how variable is global gene 
expression between individuals, before and after toxi- 
cant exposure? What are the effects of age, diet, and 
other factors on this expression? Experience, in the 
form of large data sets of toxicant exposures, will 
answer these questions. 

One of the most pressing issues for array scientists 
is the construction of a national public database 
(linked to the existing public databases) to serve as a 
repository for gene-expression data. This relational 
database must be made available for public use, and 
researchers must be encouraged to submit their ex- 
pression data so that others may view and query the 
information. Researchers at the National Institutes 
of Health have made laudable progress in develop- 
ing the first generation of such a database [44,45]. In 
addition, improved statistical methods for gene clus- 
tering and pattern recognition are needed to ana- 
lyze the data in such a public database. 

The proliferation of different platforms and meth- 
ods for microarray hybridizations will improve 
sample handling and data collection and analysis and 
reduce costs. However, the variety of microarray 
methods available will create problems of data com- 
patibility between platforms. In addition, the near- 
infinite variety of experimental conditions under 



which data will be collected by different laborato- 
ries will make large-scale data analysis extremely dif- 
ficult. To help circumvent these future problems, a 
set of standards to be included on all platforms 
should be established. These standards would facili- 
tate data entry into the national database and serve 
as reference points for cross-platform and inter-labo- 
ratory data analysis. 

Many issues remain to be resolved, but it is clear 
that new molecular techniques such as microarray 
hybridization will have a dramatic impact on toxicol- 
ogy research. In the future, the information gathered 
from microarray-based hybridization experiments will 
form the basis for an improved method to assess the 
impact of chemicals on human and environmental 
health. 
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Abstract 

Recent progress in genomics and proteomics technologies has created a unique opportunity to significantly impact 
the pharmaceutical drug development processes. The perception that cells and whole organisms express specific 
inducible responses to stimuli such as drug treatment implies that unique expression patterns, molecular fingerprints 
indicative of a drug s efficacy and potential toxicity are accessible. The integration into state-of-the-art toxicology of 
assays allowing one to profile treatment-related changes in gene expression patterns promises new insists into 
mechanisms of drug action and toxicity. The benefits will be improved lead selection, and optimized monitoring of 
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1. Introduction 

The majority of drugs act by binding to protein 
targets, most to known proteins representing en- 
zymes, receptors and channels, resulting in effects 
such as enzyme inhibition and impairment of 
signal transduction. The treatment-induced per- 
turbations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized by either global mRNA or 
global protein expression profiling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which bears valuable information on its mode of 
action and its mechanism of toxicity. 

Gene expression is a multistep process that 
results in an active protein (Fig. 1). There exist 
numerous regulation systems that exert control at 
and after the transcription and the translation 
step. Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proteomics is to quantify 
gene expression further down-stream, creating a 
snapshot of gene regulation closer to ultimate cell 
function control. 
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2. Global mRNA profiling 

Expression data at the mRNA level can be 
produced using a set of different technologies 
such as DNA microarrays, reverse transcript 
imaging, amplified fragment length polymorphism 
(AFLP), serial analysis of gene expression 
(SAGE) and others. Currently, DNA microarrays 
are very popular and promise a great potential. 
. On a typical array, each gene of interest is repre- 
sented either by a long DNA fragment (200-2400 
bp) typically generated by polymerase chain reac- 
tion (PCR) and spotted on a suitable substrate 
using robotics (Schena et al., 1995; Shalon et al., 
1996) or by several short oligonucleotides (20-30 
bp) synthesized directly onto a solid support using 
photolabile nucleotide chemistry (Fodor et aL 
1991; Chee et al., 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transcript 
abundance. 



3. Global protein profiling 

Global quantitative expression analysis at the 
protein level is currently restricted to the use of 
two-dimensional gel electrophoresis. This tech- 
mque combines separation of tissue proteins by 
isoelectric focusing in the first dimension and by 
sodium dodecyl sulfate slab gel electrophoresis- 
based molecular weight separation on the second, 
orthogonal dimension (Anderson et al.. 1991). 
The product is a rectangular pattern of protein 
spots that are typically revealed by Coomassie 
Blue, silver or fluorescent staining (Fia. 2). 
Protein spots are identified by mass spectrometry 
following generation of peptide mass fingerprints 
(Mann et al, 1993) and sequence tags (Wilkins et 
al., 1996). Similar to the mRNA approach, the 
ratio between the optical density of spots from 
control and treated samples are compared to 
search for treatment-related changes. 



4. Expression data analysis 

Bioinformatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level. The 
overall objective, once a mass of high-quality 
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quantitative expression data has been collected, is 
to visualize complex patterns of gene expression 
changes, to detect pathways and sets of genes 
tightly correlated with treatment efficacy and'toxi- 
city, and to compare the effects of different sets of 
treatment (Anderson et al., 1996). As the drug 
effect database is growing, one may detect similar- 
ities and differences between the molecular finger- 
prints produced by various drugs, information 
that may be crucial to make a decision whether to 
refocus or extend the therapeutic spectrum of a 
drug candidate. 



5. Comparison of global mRNA and protein 
expression profiling 

There are several synergies and overlaps of data 
obtained by mRNA and protein expression analy- 
sis. Low abundant transcripts may not be easily 
quantified at the protein level using standard two- 
dimensional gel electrophoresis analysis and their 
detection may require prefractionation of sam- 
ples. The expression of such genes may be prefer- 
ably quantified at the mRNA level using 
techniques allowing PCR-mediated target amplifi- 
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cation. Tissue biopsy samples typically yield good 
quality of both mRNA and proteins; however, the 
quality of mRNA isolated from body fluids is 
often poor due to the faster degradation of 
mRNA when compared with proteins. RNA sam- 
ples from body fluids such as serum or urine are 
often not very Meaningful', and secreted proteins 
are likely more reliable surrogate markers for 
treatment efficacy and safety. Detection of post- 
radiational modifications, events often related to 
function or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at the level of the 
protein in combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance (Anderson and Seilhamer, 1997) further 
suggests that the two approaches, mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 



6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drug 
effects and enhance the chances of recognizing 
potential species specificities contributing to an 
improved risk profile in humans (Richardson et 
al., 1993; Steiner et aL 1996b; Aicher et al., 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et aL, 1991, 
1995, 1996; Steiner et al. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et al., 1998). In later phases of drug devel- 



opment, surrogate markers of treatment efficacy 
and toxicity can be applied to optimize the moni- 
toring of pre-clinical and clinical studies (Dohertv 
et aL, 1998). J 



7. Perspectives 

The basic methodology of safety evaluation has 
changed little during the past decades. Toxicity in 
laboratory animals has been evaluated primarily 
by using hematological, clinical chemistry and 
histological parameters as indicators of organ 
damage. The rapid progress in genomics and pro- 
teomics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and protein 
expression profiling promises to improve lead se- 
lection, resulting in the development of drug can- 
didates with higher efficacy and lower toxicity. 
The identification of biologically relevant surro- 
gate markers correlated with treatment efficacy 
and safety bears a great potential to optimize the 
monitoring of pre-clinical and clinical trails. 
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Decoding the genetic blueprint is a dream that 
offers manifold returns in terms of understand- 
ing how organisms develop and function in an 
often hostile environment. With the rapid 
advances in molecular biology over the last 30 
years, the dream has come a step closer to reali- 
ty. Molecular biologists now have the ability to 
elucidate the composition of any genome. 
Indeed, almost 20 genomes have already been 
sequenced and more than 60 are currently 
under way. Foremost among these is the 
Human Genome Mapping Project. However, 
the genomes of a number of commonly used 
laboratory species are also under intensive 
investigation, including yeast, Arabidopsis, 
maize, rice, zebra fish, mouse, rat, and dog. It 
is widely expected that the completion of such 
programs will facilitate the development of 
many powerful new techniques and approach- 
es to diagnosing and treating genetically and 
environmentally induced diseases which afflict 
mankind. However, the vast amount of data 
being generated by genome mapping will 
require new high-throughput technologies to 
investigate the function of the millions of new 
genes that are being reported. Among the most 
widely heralded of the new functional 
genomics technologies are DNA arrays, which 
represent perhaps the most anticipated new 
molecular biology technique since polymerase 
chain reaction (PCR). 

Arrays enable the study of literally thou- 
sands of genes in a single experiment. The 
potential importance of arrays is enormous and 
has been highlighted by the recent publication 
of an entire Nature Genetics supplement dedi- 
cated to the technology (J). Despite this huge 
surge of interest, DNA arrays are still little used 
and largely unproven, as demonstrated by the 
high ratio of review and press articles to actual 
data papers. Even so, the. potential they offer 



has driven venture capitalists into a frenzy of 
investment and many new companies are 
springing up to claim a share of this rapidly 
developing rnarket. 

The U.S. Environmental Protection 
Agency (EPA) is interested in applying DNA 
array technology to ongoing toxicologic stud- 
ies. To learn more about the current state of 
the technology, the Reproductive Toxicology 
Division (RTD) of the National Health and 
Environmental Effects Research Laboratory 
(NHEERL; Research Triangle Park, NC) 
hosted a workshop on "Application of 
Microarrays to Toxicology" on 7-8 January 
1999 in Research Triangle Park, North 
Carolina. The workshop was organized by 
David Dix, Robert Kaviock, and John Rockett 
of the RTD/NHEERL. Twenty-two intra- 
mural and extramural scientists from govern- 
ment, academia, and industry shared informa- 
tion, data, and opinions on the current and 
future applications for this exciting new tech- 
nology. The workshop had more than 150 
attendees, including researchers, students, and 
administrators from the EPA, the National 
Institute of Environmental Health Sciences 
(NIEHS), and a number of other establish- 
ments from Research Triangle Park and 
beyond. Presentations ranged from the tech- 
nology behind array production through the 
sharing of actual experimental data and projec- 
tions on the future importance and applica- 
tions of arrays. The information contained in 
the workshop presentations should provide aid 
and insight into arrays in general and their 
application to toxicology in particular. 

Array Elements 

In the context of molecular biology, the word 
"array" is normally used to refer to a series of 
DNA or protein elements firmly attached in 



a regular partem to some kind of supportive 
medium. DNA array is often used inter- 
changeably with gene array or microarray. 
Although not formally defined, microarray is 
generally used to describe the higher density 
arrays typically printed on glass chips. The 
DNA elements that make up DNA arrays 
can be oligonucleotides, partial gene 
sequences, or full-length cDNAs. Companies 
offering p re-made arrays that contain less 
than full-length clones normally use regions 
of the genes which are specific to that gene to 
prevent false positives arising through cross- 
hybridization. Sequence verification of 
cDNA clone identity is necessary because of 
errors in identifying specific clones from 
cDNA libraries and databases. Premade 
DNA arrays printed on membranes are cur- 
rently or imrninendy available for human, 
mouse, and rat. In most cases they contain 
DNA sequences representing several thou- 
sand different sequence clusters or genes as 
delineated through the National Center for 
Biotechnology Information UniGene Project 
(i). Many of these different UniGene dusters 
(putative genes) are represented only by 
expressed sequence tags (ESTs). 

Array Printing 

Arrays are typically printed on one of two 
types of support matrix. Nylon membranes 
are used by most off-the-shelf array providers 
such as Clontech Laboratories, Inc. 
(Palo Alto, CA), Genome Systems, Inc. (St. 
Louis, MO), and Research Genetics, Inc. 
(Huntsville, AL). Microarrays such as those 
produced by Affymetrix, Inc. (Santa Clara, 
CA), Incyte Pharmaceuticals, Inc. (Palo Alto, 
CA), and many do-it-yourself (DIY) arraying 
groups use glass wafers or slides. Although 
standard microscope slides may be used, they 
must be preprepared to facilitate sticking 
of the DNA to the glass. Several different 
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coatings have been successfully used. Includ- 
ing silane and lysine. The coating of slides 
can easily be carried out in the laboratory) 
but many prefer the convenience of precoated 
slides available from suppliers. 

Once the support matrix has been pre- 
pared, the DMA elements can be applied by 
several methods, Afrymetrix, Inc., has devel- 
oped a unique photolithographic technology 
for attaching oligonucleotides to glass wafers. 
More commonly, DNA is applied by either 
noncontacr or contact printing. Noncontact 
printers can use thermal, solenoid, or piezoelec- 
tric technology to spray aliquots of solution 
onto the support matrix and may be used to 
produce slide or membrane-based arrays. 
Cartesian Technologies, Inc. (Irvine, CA) has 
developed nQUAD technology for use in its 
PixSys printers. The system couples a syringe 
pump with the microsolenoid valve, a combi- 
nation that provides rapid quantitative dispens- 
ing of nanoUter volumes (down to 42 nL) over 
a variable volume range. A different approach 
to noncontact printing uses a solid pin and ring 
combination (Generic MicroSystems, Inc., 
Woburn, MA). This system (Figure 1) allows a 
broader range of sample, including cell suspen- 
sions and particulates, because the printing 
head cannot be blocked up in the same way as 
a spray nozzle. Fluid transfer is controlled in 
this system primarily by the pin dimensions 
and the force of deposition, although the 
nature of the support matrix and the sample 
will also affect transfer to some degree. 

In contact printing, the pin head is dipped 
in the sample and then touched to the support 
matrix to deposit a small aliquot, Split pins 
were one of the first contaa-prinring devices 
to be reported and are the suggested format 
for DIY arrayers, as described by Brown (3). 
Split pins are small metal pins with a precise 
groove cut vertically in the middle of the pin 
tip. In this system, 1-48 split pins are posi- 
tioned in the pin-head. Hie split pins work by 
simple capillary action, not unlike a fountain 
pen — when the pin heads are clipped in the 
sample, liquid is drawn into the pin groove. A 
small (fixed) volume is then deposited each 
time the split pins are gently touched to 
the support matrix. Sample (100-500 pL 
depending on a variety of parameters) can be 
deposited on multiple slides before refilling is 
required, and array densities of > 2,500 
spots/cm 2 may be produced. The deposit vol- 
ume depends on the split size, sample fluidi- 
ty, and the speed of printing. Split pins are 
relatively simple to produce and can be made 
in-house if a suitable machine shop is avail- 
able. Alternatively, they can be obtained 
direcdy from companies such as TcleChem 
International, Inc. (Sunnyvale, CA). 

Irrespective of their source, printers 
should be run through a preprint sequence 
prior to producing the actual experimental 



arrays; the first 100 or so spots of a new run 
tend to be somewhat variable. Factors effect- 
ing spot reproducibility include slide treat- 
ment homogeneity, sample differences, and 
instrument errors. Other factors that come 
into play include clean ejection of the drop 
and clogging (nQUAD printing) and 
mechanical variations and long-term alter- 
ation in print-head surface of solid and split 
pins. However, with careful preparation it is 
possible to get a coefficient of variance for 
spot reproducibility below 10%. 

One potential printing problem is sample 
carryover. Repeated washing, blotting, and 
drying (vacuum) of print pins between samples 
is normally effective at reducing sample carry- 
over to negligible amounts. Printing should 
also be carried out in a controlled environ- 
ment. Humidified chambers are available in 
which to place printers. These help prevent 
dust contamination and produce a uniform 
drying rate, which is important in determining 
spot size, quality, and reproducibility. 

In summary, although several printing 
technologies are available, none are par- 
ticularly outstanding and the bottom line 
is that they are still in a relatively early stage 
of evolution. 

Array Hybridization 

The hybridization protocol is, practically 
speaking, relatively straightforward and those 
with previous experience in blotting should 
have little difficulty. Array hybridizations 
are, in essence, reverse Southern/Northern 
blots — instead of applying a labeled probe to 
the target population of DNA/RNA, the 
labeled population is applied to the probe(s). 
With membrane-based arrays, the control and 
treated mRNA populations are normally con- 
verted to cDNA and labeled with isotope (eg., 
33 P) in the process. These labeled populations 
are then hybridized independendy to parallel 
or serial arrays and the hybridization signal is 
detected with a phosporimager. A less com- 
monly used alternative to radioactive probes is 
enzymatic detection. The probe may be 
biotinylated, haptenylated, or have alkaline 
phosphatasc/horseradish peroxidase attached. 
Hybridization is detected by enzymatic reac- 
tion yielding a color reaction {4). Differences 
in hybridization signals can be detected by eye 
or, more accurately, with the help of digital 
imaging and commercially available software. 
The labeling of the test populations for slide- 
based microarrays uses a slightly different 
approach. The probe typically consists of two 
samples of poryA* RNA (usually from a treated 
and a control population) that are converted to 
cDNA; in the process each is labeled with a 
different fluor. The independently labeled 
probes are then mixed together and hybridized 
to a single microarray slide and the resulting 
combined fluorescent signal is scanned. After 
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Figure 1. Genetic Microsystems (Wobum, MA} pin 
ring system for printing arrays. The pin ring com- 
bination consists of a circular open ring oriented 
parallel to the sample solution, with a vertical pin 
centered over the ring. When the ring is dipped 
into a solution and lifted, it withdraws an aliquot 
of sample held by surface tension. To spot the 
sample, the pin is driven down through the ring 
and a portion of the solution is transferred to the 
bottom of the pin. The pin continues to move 
downward until the pendant drop of solution 
makes contact with the underlying surface. The 
pin is then lifted, and gravity and surface tension 
cause deposition of the spot onto the array. 
Figure from Rowers et al. (74), with permission 
from Genetic Microsystems. 

normalization, it is possible to determine the 
ratio of fluorescent signals from a single 
hybridization of a slide-based microarray. 

cDNA derived from control and treated 
populations of RNA is most commonly 
hybridized to arrays, although subtractive 
hybridization or differential display reactions 
may also be used. Fluorophore- or radiola- 
beled nucleotides are directly incorporated 
into the cDNA in the process of converting 
RNA to cDNA. Alternatively, 5' end-labeled 
primers may be used for cDNA synthesis. 
These are labeled with a fluorophore for 
direct visualization of the hybridized array. 
Alternatively, biotin or a hapten may be 
attached to the primer, in which case fluor- 
labeled streptavidin or antibody must be 
applied before a signal can be generated. The 
most commonly used fiuorophores at present 
are cyanine (Cy)3 and Cy5 (Amersham 
Pharmacia Biotech AB, Uppsala, Sweden). 
However, the relative expense of these fluo- 
rescent conjugates has driven a search for 
cheaper alternatives. Fluorescein, rhodarnine, 
and Texas red have all been used, and 
companies such as Molecular Probes, Inc. 
(Eugene, OR) are developing a series of 
labeled nucleotides with a wide range of exci- 
tation and emission spectra which may prove 
to function as well as the Cy dyes. 
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Table 1. Advantages and disadvantages of different microarray scanning systems. 



Nonconfocal laser scanner 


Advantages 
Disadvantages 


Few moving parts 

Fast scanning of bright 
samples 

Less appropriate for dim 
samples 

Optical scatter can limit 
performance 


Relatively simple optics 

Low light collection efficiency 
Background artifacts not rejected 
Resolution typically low 


Small depth of focus reduces 
artifacts 

May have high light collection 
efficiency 

Small depth of focus requires 
scanning precision 



Analysis of DNA Microarrays 

Membrane-based arrays are normally analyzed 
on Him or with a phosphorimager, whereas 
chip-based arrays require more specialized scan- 
ning devices. These can be divided into three 
main groups: the charge-coupled device camera 
systems, the nonoonfbeal laser scanners, and the 
confbcal laser scanners. The advantages and dis- 
advantages of each system are listed in Table 1. 

Because a typical spot on a microarray can 
contain > 10 s molecules, it is clear that a large 
variation in signal strength may occur. 
Current scanners cannot work across this 
many orders of magnitude (4 or 5 is more typ- 
ical). However, the scanning parameters can 
normally be adjusted to collect more or less 
signal, such that two or three scans of the same 
array should permit the detection of rare and 
abundant genes. 

When a microarray is scanned, the fluores- 
cent images are captured by software normally 
included with the scanner. Several commercial 
suppliers provide additional software for quan- 
tifying array images, but the software tools are 
constandy evolving to meet the developing 
needs of researchers, and it is prudent to 
define one's own needs and clarify the exact 
capabilities of the software before its purchase. 
Issues that should be considered include the 
following: 

* Can the software locate offset spots? 

* Can it quantitare across irregular hybridiza- 
tion signals? 

* Can the arrayed genes be programmed in for 
easy identification and location? 

* Can the software connect via the Internet to 
databases containing further information on 
the gertc(s) of interest? 

One of the key issues raised at the work- 
shop was the sensitivity of microarray technol- 
ogy. Experiments by General Scanning, Inc. 
(Watertown, MA), have shown that by using 
the Cy dyes and their scanner, signal can be 
detected down to levels of < 1 fluor molecule 
per square micrometer, which translates to 
detecting a rare message at approximately one 
copy per cell or less. 

Array Applications 

Although arrays are an emerging technology 
certain to undergo improvement and 
alteration,*they have already been applied use- 
fully to a number of model systems. Arrays are 
at their most powerful when they contain the 
entire genome of the species they are being 
used to study. For this reason, they have strong 
support among researchers utilizing yeast and 
Caenorhabditis elegans (5). The genomes of 
both of these species have been sequenced and, 
in the case of yeast, deposited onto arrays for 
examination of gene expression {6,7). With 
both of these species, it is relatively easy to 
perturb individual gene expression. Indeed, C 



CCD, charge-coupled device. 
From Kawasaki (73}. 

elegans knockouts can be made simply by 
soaking the worms in an antisense solution of 
the gene to be knocked out. 

By a process of systematic gene disrup- 
tion, it is now possible to examine the cause 
and effect relationships between different 
genes in these simple organisms. This kind of 
approach should help elucidate biochemical 
pathways and genetic control processes, 
decon volute polygenic interactions, and 
define the architecture of the cellular network. 
A simple case study of how this can be 
achieved was presented by Butow [University 
of Texas Southwestern Medical Center, 
Dallas, TX (Figure 2)}. Although it is the 
phenotypic result of a single gene knockout 
that is being examined, the effect of such 
perturbation will almost always be polygenic 
Polygenic interactions will become increasing- 
ly important as researchers begin to move ' 
away from single gene systems when examin- 
ing the nature of toxicologic responses to 
external stimuli. This is especially important 
in toxicology because the phenotype pro- 
duced by a given environmental insult is 
never the result of the action of a single gene; 
rather, it is a complex interaction of one or 
multiple cellular pathways. Phenomena such 
as quantitative trait (the continuous variation 
of phenotype), epistasis (the effect of alleles of 
one or more genes on the expression of other 
genes), and penetrance (proportion of indi- 
viduals of a given genotype that display a par- 
ticular phenotype) will become increasingly 
evident and important as toxicologjsts push 
toward the ultimate goal of matching the 
responses of individuals to different 
environmental stimuli. 

Analysis of the transcriptome (the expres- 
sion level of all the genes in a given cell popula- 
tion) was a use of arrays addressed by several 
speakers. Unfortunately, current gene nomen- 
clature is often confusing in that single genes 
are allocated multiple names (usually as a result 
of independent discovery by different laborato- 
ries), and there was a call for standardization of 
gene nomenclature. Nevertheless, once a tran- 
scriptome has been assembled it can then be 
transferred onto arrays and used to screen any 
chosen system. The EPA MicroArray 
Consortium (EPAMAQ is assembling testes 



transcriptomes for human, rat, and mouse. In a 
slightly different approach, Nuwaysir et al. (fl) 
describes how the NIEHS assembled what is 
effectively a "toxicological transcriptome" — a 
library of human and mouse genes that have 
previously been proven or implicated in 
responses to toxicologic insults. Clontech 
Laboratories, Inc. (Palo Alto, CA), has begun a 
similar process by developing stress/toxicology 
filter arrays of rat, mouse, and human genes. 
Thus, rather than being tissue or cell specific, 
these stress/toxicology arrays can be used across 
a variety of model systems to look for alter- 
ations in the expression of toxicologically 
important genes and define the new field of 
toxicogenomics. The potential to identify toxi- 
cant families based on tissue- or cell-specific 
gene expression could revolutionize drug test- 
ing. These molecular signatures or fingerprints 
could not only point to the possible 
toxicity/carcinogenicity of newly discovered 
compounds (Figure 3), but also aid in elucidat- 
ing their mechanism of action through identifi- 
cation of gene expression networks. By exten- 
sion, such signatures could provide easily iden- 
tifiable biomarkers to assess the degree, time, 
and nature of exposure. 

DNA arrays are primarily a tool for exam- 
ining differential gene expression in a given 
model. In this context they are referred to as 
closed systems because they lack the ability of 
other differential expression technologies, eg., 
differential display and subtractive hybridiza- 
tion, to detect previously unknown genes not 
present on the array. This would appear to 
limit the power of DNA arrays to the imagina- 
tions and preconceptions of the researcher in 
selecting genes previously characterized and 
thought to be involved in the model system. 
However, the various genome sequencing pro- 
jects have created a new category of 
sequence — die EST — that has partially molli- 
fied this deficiency. ESTs are cDNAs expressed 
in a given tissue that, although they may share 
some degree of sequence similarity to previous- 
ly characterized genes, have not been assigned 
specific genetic identity. By incorporating EST 
clones into an array, it is possible to monitor 
the expression of these unknown genes. This 
can enable the identification of previously 
uncharacterized genes that may have biologic 
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significance in the model system. Filter arrays 
from Research Genetics and slide arrays from 
Incyte Pharmaceuticals both incorporate large 
numbers of ESTs from a variety of species. 

A further use of microarrays is die identifi- 
cation of single nucleotide polymorphisms 
(SNPs). These genomic variations are abun- 
dant — they occur approximately every 1 kb or 
so — and are the basis of restriction fragment 
length polymorphism analysis used in forensic 
analysis. Affymetrix, Inc., designed chips that 
contain multiple repeats of the same gene 
sequence. Each position is present with all four 
possible bases. After the hybridization of the 
sample, the degree of hybridization to the dif- 
ferent sequences can be measured and the exact 
sequence of the target gene deduced. SNPs are 
thought to be of vital importance in drug 
metabolism and toxicology. For example, sin- 
gle base differences in the regulatory region or 
active site of some genes can account for huge 
differences, in the activity of that gene. Such 
SNPs are thought to explain why some people 
are able to metabolize certain xenobiotics bet- 
ter than others. Thus, arrays provide a further 
tool for the toxicologist investigating the 
nature of susceptible subpopulations and toxi- 
cologic response. 

There are still many wrinkles to be ironed 
out before arrays become a standard tool for 
toxicologists. The main issues raised at the 
workshop by those with hands-on experience 
were the following: 

• Expense: the cost of purchasing/contracting 
this technology is still too great for many 
individual laboratories. 




Figure 2. Potential effects of gene knockout within 
positively and negatively regulated gene expression 
networks. *, is limiting in wild type for expression of 
^. {A} A simple, two-component, linear regulatory 
network operating on gene ^ where /, is a positive 
effector of ^ and j n is either a positive or negative 
effector of This network could be deduced by 
examining the consequence of (B) deleting j n on the 
expression of /, and ^ where the expression of ^ 
would be decreased or increased depending on 
whether j n was a positive or negative regulator. 
These and other connected components of even 
greater complexity could be revealed by genome- 
wide expression analysis. From Butow ( 75). 



► Clones: the logistics of identifying, obtaining, 
and maintaining a set of nonredundant, non- 
contaminated, sequence-verified, species/cell/ 
tissue/field-specific clones. 

► Use of inbred strains: where whole-organism 
models are being used, the use of inbred 
strains is important to reduce the potentially 
confusing effects of the individual variation 
typically seen in outbred populations. 
Probe: the need for relatively large amounts 
of RNA, which limits the type of sample 
(eg., biopsy) that can be used. Also, different 
RNA extraction methods can give different 
results. 

Specificity: the ability to discriminate accu- 
! rarefy between closely related genes (eg., the 
cytochrome p450 family) and splice variants, 
t Quantitation: the quantitation of gene 
I expression using gene arrays is still open to 
| debate. One reason for this is the different 
incorporation of the labeling dyes. However, 
the main difficulty lies in knowing what to 
normalize against One option is to include a 
large number of so-called housekeeping genes 
in the array. However, the expression of these 
genes often change depending on the tissue 
and the toxicant, so it is necessary to charac- 
terize the expression of these genes in the 
model system before utilizing them. This is 
clearly not a viable option when screening 
multiple new compounds. A second option 
is to include on the array genes from a nonre- 
lated species (eg., a plant gene on an animal 
array) and to spike die probe with synthetic 
RNA(s) complementary to the gene(s). 
Reproducibility: this is sometimes question- 
able, and a figure of approximately two or 
three repeats was used as the minimum num- 
ber required to confirm initial findings. 



Again, however, most people advocated the 
use of Northern blots or reverse transcriptase 
PGR to confirm findings. 

* Sensitivity: concerns were voiced about the 
number of target molecules that must be pre- 
sent in a sample for them to be detected on 
the array. 

* Efficiency: reproducible identification of 1.5- 
to 2-fold differences in expression was report- 
ed, although the number of genes that 
undergo this level of change and remain 
undetected is open to debate. It is important 
that this level of detection be ultimately 
achieved because it is commonly perceived 
that some important transcription factors 
and their regulators respond at such low lev- 
els. In most cases, 5- to 5-fold was die mini- 
mum change that most were happy to 
accept. 

* Bioinforrnatics: perhaps the greatest concern 
was how to accurately interpret the data with 
the greatest accuracy and efficiency. The 
biggest headache is trying to identify net- 
works of gene expression that are common to 
different treatments or doses. The amount of 
data from a single experiment is huge. It may 
be that, in the future, several groups individ- 
ually equipped with specialized software algo- 
rithms for studying their favorite genes or 
gene systems will be able to share the same 
hybridized chips. Thus, arrays could usher in 
a new perspective on collaboration and the 
sharing of data. 

EPAMAC 

Perhaps the main reason most scientists are 
unable to use array technology is the high cost 
involved, whether buying off-the-shelf mem- 
branes, using contract printing services, or 
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Figure 3. Gene expression profiles— also called fingerprints or signatures — of known toxicants or toxi- 
cant families may, in the future, be used to identify the potential toxicity of new drugs, etc. In this exam- 
ple, the genetic signature of test compound 1 is identical to that of known peroxisome proliferates, 
whereas that of test compound 2 does not match any known toxicant family. Based on these results, test 
cpmpound 2 would be retained for further testing and test compound 1 would be eliminated. 
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producing chips in-housc. In view of this, 
researchers at the RTD/NHEERL initiated 
the EPAMAC. This consortium brings 
together scientists from the EPA and a num- 
ber of extramural labs with the aim of devel- 
oping microarray capability through the shar- 
ing of resources and data. EPAMAC 
researchers are primarily interested in the 
developmental and toxicologic changes seen 
in testicular and breast tissue, and a portion 
of the workshop was set aside for EPAMAC 
members to share their ideas on how the 
experimental application of microarrays could 
facilitate their research. One of the central 
areas of interest to EPAMAC members is the 
effect of xenobiotics on male fertility and 
reproductive health. Of greatest concern is 
the effect of exposure during critical periods 
of development and germ cell differentiation 
(9), and how this may compromise sperm, 
counts and quality following sexual matura- 
tion (10). As well as spermatogenic tissue, 
there is also interest in how residual mRNA 
found in mature sperm (II) could be used as 
an indicator of previous xenobiotic effects (it 
is easier to obtain a semen sample than a tes- 
ticular biopsy). Arrays will be used to examine 
and compare the effect of exposure to heat 
and chemicals in testicular and epididymal 
gene expression profiles, with the aim of 
establishing relationships/associations 
between changes in developmental landmarks 
and the effects on sperm count and quality. 
Cluster, pattern, and other analysis of such 
data should help identify hidden relationships 
between genes that may reveal potential 
mechanisms of action and uncover roles for 
genes with unknown functions. 

Summary 

The full impact of DNA arrays may not be 
seen for several years, but the interest shown at 
this regional workshop indicates the high level 
of interest that they foster. Apart from educat- 
ing and advertising the various technologies in 
this field, this workshop brought together a 
number of researchers from the Research 
Triangle Park area who are already using DNA 
arrays. The interest in sharing ideas and experi- 
ences led to the initiation of a Triangle array 
user's group. 
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Array technology is still in its infancy. This 
mearU that the hardware is still improving and 
therd is no current consensus for standard pro- 
cedures, quantitation, and interpretation. 
Consistency in spotting and scanning arrays is 
not yet optimized, and this is one of the most 
critical requirements of any experiment. In 
addition, one of the dark regions of array tech- 
nology — strife in the courts over who owns 
what) portions of it — has further muddled the 
future and is a potential barrier toward the 
development of consensus procedures. 

Perhaps the greatest hurdle for the applica- 
tion of arrays is the actual interpretation of 
data, No specialists in bioinformatics attended 
the workshop, largely because they are rare and 
because as yet no one seems clear on the best 
method of approaching data analysis and inter- 
pretation. Cross-referencing results from mul- 
tiple jexperiments (time, dose, repeats, different 
animals, different species) to identify common- 
ly expressed genes is a great challenge. In most 
cases; we are still a long way from understand- 
ing How the expression of gene X is related to 
the Expression of gene Y, and ordering gene 
expression to delineate causal relationships. 

To the ordinary scientist in the typical lab- 
oratory, however, the most immediate prob- 
lem Is a lack of affordable instrumentation. 
One) can purchase premade membranes at 
relatively affordable prices. Although these 
may! be useful in identifying individual genes 
to pursue in more detail using other methods, 
the r} umbers that would be required for even a 
small routine toxicology experiment prohibit 
this as a truly viable approach. For the toxicol- 
ogistt, there is a need to carry out multiple 
experiments — dose responses, time curves, 
multiple animals, and repeats. Glass-based 
DNA arrays are most attractive in this context 
because they can be prepared in large batches 
from the same DNA source and accommo- 
date control and treated samples on the same 
chip] Another problem with current off-the- 
| arrays is that they often do not contain 
one pr more of the particular genes a group is 
interested in. One alternative is to obtain 
ar produce a set of custom clones and 
J contract printing of membranes or slides 
carried out by a company such as Genomic 
Solutions, Inc. (Ann Arbor, MI). This approach 
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is less expensive than laying out capital for 
one's own entire system, although at some 
point it might make economic sense to print 
one's own arrays. 

Finally, DNA arrays are currendy a team 
effort They are a technology that uses a wide 
range of skills including engineering, statistics, 
molecular . biology, chemistry, and bioinfbr- 
marics. Because most individuals are skilled in 
only one or perhaps two of these areas, it 
appears that success with arrays may be best 
expected by teams of collaborators consisting 
of individuals having each of these skills. 

Those considering array applications may 
be amused or goaded on by the following 
quote from Fortune magazine (12): 

Microprocessors have reshaped our economy, . 
spawned vast fortunes and changed die way we live. 
Gene chips could be even bigger. 

Although this comment may have been 
designed to excite the imagination rather than 
accurately reflect the truth, it is fair to say that 
the age of functional genomics is upon us. 
DNA arrays look set to be an important tool in 
this new age of biotechnology and will likely 
contribute answers to some of toxicology's 
most fundamental questions. 
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"•*. DocketNo.: PF-0594USN 

USSN: 09/786,797 

Subject: RE: [Fwd: Toxicology Chip] M No -°- 

Date: Mon. 3 Jul 2000 08:09:45 -0400 
From: "Afshari.Cymhia" <afshari<§'niehs.nih.gov> 
To: ""Diana Hamlet-Cox*" <dianahc<&incvie.com> 



You car. see the list of clones that we have on our 12X chip at 
ht t? : mar.ue 1 . r.iehs . *;h . ccv raps • cues: ' clcnesrch . cf r. 

We selectee a subset of genes ( 2000K ) that we believed critical to to:*: 
response and basic cellular processes and added a set cf clones and Z"s :r 
this. We have included a set of control genes (80-) that were selected ry 
the IvHGF.l because they did not change across a large set of array 
experiments. However, we have found that some of these genes chance 
signficantly after tox treatments and are in the process cf looking at the 
variation of each of these 80* genes across our experiments. 
Our chips are constantly changing and being updated and we hope that cur 
data will lead us to what the toxchip should really be. 
Z hope this answers your question. 
Cindy Afshari 



> From: Diana Hamlez-Cox 

> Sent: Monday, June 26, 2000 8:52 PM 

> To: afshariQniehs.nih.gov 

> Subject: [Fwd: Toxicology Chip] 
> 

> Dear Dr. Afshari, 
> 

> Since I have not yet had a response from Sill Grigg, perhaps he was not 

> the right person to contact. 
> 

> Can you help me in this matter? I. don't need to know the sequences . 

> necessarily, but I would like very much to know what types of sequences 

> are being used, e.g., GPCRs (more specific?) , ion channels, etc. 
> 

> Diana Hamlet-Cox 

> 

> Original Message 

> Subject: Toxicology Chip 

> Daze: Mon, 19 Jun 2000 18:31:48 -0700 

> From: Diana Hamlet-Cox <dianahcQ incyte. com> 

> Organization: Incyte Pharmaceuticals 

> To: griggQniehs.nih.gov 
> 

> Dear Colleague: 
> 

> I am doing literature research on the use of expressed genes as 

> pharmacotoxicology markers, and found the Press Release dated February 

> 29, 2000 regarding the work of the NIZHS in this area. 1 would like to 

> know if there is a resource I can access (or you could provide?) that 

> would give me a list of the 12,000 genes that are on your Human ToxChip 

> Microarray. In particular, I am interested in the criteria used to 

> select sequences for the ToxChip, including any control sequences 

> included in the microarray. 
> 

> Thank you for your assistance in this request. 
> 

> Diana Hamlet-Cox, Ph.D. 

> Incyte Genomics, Inc. 



> 



07/31/2000 10:34 AM 



rf:is email message is for zhe sole use of zhe inrended rezipier.z s sr.z 
may conzair. ccr.fide-zial and privileged izforzazion sub jeer rr 
a= *omey-clie.*:r privilege. Ar.y ^laszhcrired review, use, disclosure .rr 
diszribxzior. is prohibized. If you are noz zhe irzended renpier.z. 
please cor.zarz zhe ser.der by reply snail ar.d deszroy all ccpies cf zhe 
original message* 
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Proteomics: a major new 
technology for the drug 
discovery process 

Martin J. Page, Bob Amess, Christian Rohlff, Colin Stubberfield 
and Raj Parekh 



Proteomics is a new enabling technology that is being 
integrated into the drug discovery process. This will 
facilitate the systematic analysis of proteins across any 
biological system or disease, forwarding new targets 
and information on mode of action, toxicology and sur- 
rogate markers. Proteomics is highly complementary to 
genomic approaches in the drug discovery process and, 
for the first time, offers scientists the ability to integrate 
information from the genome, expressed mRNAs, their 
respective proteins and subcellular localization. It is ex- 
pected that this will lead to important new insights into 
disease mechanisms and improved drug discovery 
strategies to produce novel therapeutics. 

Among the major pharmaceutical and biotechnol- 
ogy companies, it is clearly recognized that the 
business of modern drug discovery is a highly 
competitive process. All of the many steps in- 
volved are inherently complex, and each can involve a 
high risk of attrition. The players in this business strive 
continuously to optimize and streamline the process; each 
seeking to gain an advantage at every step by attempting 
to make informed decisions at the earliest stage possible. 
The desired outcome is to accelerate as many key activities 
in the drug discovery process as possible. This should pro- 



duce a new generation of robust drugs that offer a high 
probability of success and reach the clinic and market 
ahead of the competition. 

There has been noticeable emphasis over recent years 
for companies to aggressively review and refine their 
strategies to discover new drugs. Central to this has been 
the introduction and implementation of cutting-edge 
technologies. Most, if not all, companies have now inte- 
grated key technology platforms that incorporate gen- 
omics, mRNA expression analysis, relational databases, 
high-throughput robotics, combinatorial chemistry and 
powerful bioinformatics. Although it is still early days to 
quantify the real impact of these platforms in clinical and 
commercial terms, expectations are high, and it is widely 
accepted that significant benefits will be forthcoming. This 
is largely based on data obtained during preclinical studies 
where the genomic 1 * 2 and microarray 3 ' 1 technologies have 
already proved their value. 

However, there are several noteworthy outcomes that re- 
sult from this. Many comments are voiced that scientists 
armed with these technologies are now commonly faced 
with data overload. Thus, in some instances, rather than 
facilitating the decision process, the accumulation of more 
complex data points, many with unknown consequences, 
can seem to hinder the process. Also, most drug compa- 
nies have simultaneously incorporated very similar compo- 
nents of the new technology platforms, the consequence 
being that it is becoming difficult yet again to determine 
where a clear competitive advantage will arise. Finally, in 
recent years, largely as a result of the accessibility of the 
technologies, there has been an overwhelming emphasis 
placed on genomic and mRNA data rather than on protein 
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Figure J. Steps involved in analysing a biological sample by proteomics. MCI, molecular cluster index. 



analysis. It is important to remember that proteins dictate 
biological phenotype - whether it is normal or diseased - 
and are the direct targets for most drugs. 

Proteomics: new technology for 
the analysis of proteins 

It is now timely to recognize that complementary technol- 
ogy in the form of high-throughput analysis of the total 
protein repertoire of chosen biological samples, namely 
proteomics, is poised to add a new and important dimen- 
sion to drug discovery. In a similar fashion to genomics, 
which aims to profile every gene expressed in a cell, pro- 
teomics seeks to profile every protein that is expressed 5 " 7 . 
However, there is added information, since proteomics can 
also be used to identify the post-translational modifications 
of proteins 8 , which can have profound effects on bio- 
logical function, and their cellular localization. Importantly, 
proteomics is a technology that integrates the significant 
advances in two-dimensional (2D) electrophoretic separa- 
tion of proteins, mass spectrometry and bioinformatics. 
With these advances it is now possible to consistently de- 
rive proteomes that are highly reproducible and suitable 
for interrogation using advanced bioinformatic tools. 

There are many variations whereby different laboratories 
operate proteomics. For the purpose of this review, the 



process used at Oxford GlycoSciences (OGS), which uses 
an industrial-scale operation that is integral to its drug dis- 
covery work, will be described. The individual steps of 
this process, where up to 1000 2D gels can be run and 
analysed per week, are summarized in Fig. 1. The incom- 
ing samples are bar coded and all information relevant to 
the sample is logged into a Laboratory Information 
Management System (LIMS) database. There can be a wide 
range in the type of samples processed, as applicable to 
individual steps in the drug discovery pipeline, and these 
will be mentioned later. The samples are separated accord- 
ing to their charge (pi) in the first dimension, using iso- 
electric focusing, followed by size (MW) using SDS-PAGE 
in the second dimension. Many modifications have been 
made to these steps to improve handling, throughput and 
reproducibility. The separated proteins are then stained 
with fluorescent dyes which are significantly more sensi- 
tive in detection than standard silver methods and have a 
broader dynamic range. The image of the displayed pro- 
teins obtained is referred to as the proteome, and is digi- 
tally scanned into databases using proprietary software 
called ROSETTA™. The images are subsequently curated, 
which begins with the removal of any artefacts, cropping 
and the placement of pI/MW landmarks. The images from 
replicate images are then aligned and matched to one 
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another to generate a synthetic composite image. This is 
an important step, as the proteome is a dynamic situation, 
and it captures the biological variation that occurs, such 
that even orphan proteins are still incorporated into the 
analysis. 

By means of illustration, Fig. 1 shows the process 
whereby proteomes are generated from normal and dis- 
ease samples and how differentially expressed proteins are 
identified. The potential of this type of analysis is tremen- 
dous. For example, from a mammalian cell sample, in ex- 
cess of 2000 proteins can typically be resolved within the 
proteome. The quality of this is shown in Fig. 2, which 
shows representative proteomes from three diverse bio- 
logical sources: human serum, the pathogenic fungus 
Candida albicans and the human hepatoma cell line 
Huh7. 

Use of proteomics to identify 
disease specific proteins 

In most cases, the drug discovery process is initiated by 
the identification of a novel candidate target - almost al- 
ways a protein - that is believed to be instrumental in the 
disease process. To date, there is a variety of means 
whereby drug targets have been forthcoming. These in- 
clude molecular, cellular and genomic approaches, mostly 
centred upon DNA and mRNA analysis. The gene in ques- 
tion is isolated, and expression and characterization of its 
coded protein product - i.e. the drug target - is invariably 
a secondary event. 

With the proteomic approach, the starting point is at the 
other end of the 'telescope'. Here there is direct and im- 



mediate comparison of the proteomes from paired normal 
and disease materials. Examples of these pairs are: (1) pu- 
rified epithelial cell populations derived from human 
breast tumours, matched to purified normal populations of 
human breast epithelial cells, and (2) the invading patho- 
genic hyphal form of C. albicans, matched to the non- 
invading yeast form of C. albicans. When the proteome 
images from each pair are aligned, the Proteograph™ soft- 
ware is able to rapidly identify those proteins (each refer- 
enced as having a unique molecular cluster index, or MCI) 
that are either unique, or those that are differentially ex- 
pressed. Thus, the Proteograph output from this analysis is 
both qualitative and quantitative. 

Proteograph analysis for a particular study can also be 
undertaken on any number of samples. For example, one 
might compare anything from a few to several hundred 
preparations or samples, each from a normal and disease 
counterpart, and have these analysed in a single 
Proteograph study. In this way, it is possible to assign 
strong statistical confidence to the data and in some in- 
stances to identify specific subpopulations within the input 
biological sources. This feature will become increasingly 
significant in the near future, and there is a clear synergy 
here whereby proteomics can work closely with pharma- 
cogenomic approaches to stratify patient populations and 
achieve effective targeted care for the patient. Whatever 
the source of the materials, the net output of Proteograph 
analysis is immediate identification of disease specific pro- 
teins. This is shown in Fig, 3, which shows the results of 
a proteograph obtained by comparing untreated human 
hepatoma cells with cells following exposure to a clinical 
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Figure Z Representative proteomes obtained from (a) human serum, (b) the pathogenic fungus Candida albicans 
and (c) the human hepatoma cell line Huh7. 
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Foregrounds: Huh7 cells treated with 5FU 

Backgrounds: Huh7 cells untreated 

HSmnm Upregulated in Huh7 cells treated with 5FU 

with respect to untreated Huh7 cells 
■■■■■■■ Down regulated in Huh7 cells treated with 5FU 

with respect to untreated Huh7 cells 
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Figure J. Table of differential protein expression 
profiles, referred to as a Rosetta Proteograpb ™, 
between Hub 7 cells witb and without the cytotoxic 
agent 5-FU. Bars are quantized and do not represent 
exact fold change values. 



cytotoxic agent. In this instance, only the top 20 differen- 
tially expressed MCIs are shown, but the readout would 
normally extend to a defined cut-off value, typically a two- 
fold or greater difference in expression levels, determined 
by the user. 

In a typical analysis involving disease and normal mam- 
malian material, in which each proteome would have 
-2000 protein features each assigned an MCI, the proteo- 
graph might identify somewhere in the region of 50-300 
MCIs that are unique or differentially expressed. To capi- 
talize rapidly on these data, at OGS a high-throughput 
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mass spectrometry facility coupled to advanced databases 
to annotate these MCIs as individual proteins is applied. As 
these are all disease specific proteins, each could represent 
a novel target and/or a novel disease marker. The process 
becomes even more powerful when a panel of features, 
rather than individual features, are assigned. The relevance 
of this is apparent when one considers that most diseases, 
if not all, are multifactorial in nature and arise from poly- 
genic changes. Rather than analysing events in isolation, 
the ability to examine hundreds or thousands of events 
simultaneously, as shown by proteomics, can offer real 
advantages. 

Identification and assignment of candidate targets 
The rapid identification and assignment of candidate tar- 
gets and markers represents a huge challenge, but this has 
been greatly facilitated by combining the recent advances 
made in proteomics and analytical mass spectrometry 9 . 
Using automated procedures it is now possible to annotate 
proteins present in femtomole quantities, which would de- 
pict the low abundance class of proteins. The process of 
annotation is similarly aided by the quality and richness of 
the sequence specific databases that are currently avail- 
able, both in the public domain and in the private sector 
(e.g. those supplied by Incyte Pharmaceuticals). In this re- 
spect, the advances in proteomics have benefited consider- 
ably from the breakthroughs achieved with genomics. 

From an application perspective, cancer studies provide a 
good opportunity whereby proteomics can be instrumental 
in identifying disease specific proteins, because it is often 
feasible to obtain normal and diseased tissue from the same 
patient. For example, proteomic studies have been re- 
ported on neuroblastomas 10 , human breast proteins from 
normal and tumour sources 11 " 13 , lung tumours 14 , colon tu- 
mours 15 and bladder tumours 16 . There are also proteomic 
studies reported within the cardiovascular therapeutic area, 
in which disease or response proteins are identified 1718 . 

Genomic microarray analysis can similarly identify 
unique species or clusters of mRNAs that are disease spe- 
cific. However, in some instances, there is a clear lack of 
correlation between the levels of a specific mRNA and its 
corresponding protein (Ref 19, Gypi, S.R et aL, submit- 
ted). This has now been noted by many investigators and 
reaffirms that post- transcriptional events, including protein 
stability, protein modification (such as phosphorylation, 
glycosylation, acylation and methylation) and cell localiz- 
ation, can constitute major regulatory steps. Proteomic 
analysis captures all of these steps and can therefore pro- 
vide unique and valuable information independent from, 
or complementary to, genomic data. 
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Proteomics for target validation and signal transduc- 
tion studies 

The identification of disease specific proteins alone is in- 
sufficient to begin a drug screening process. It is critical to 
assign function and validation to these proteins by con- 
firming they are indeed pivotal in the disease process. 
These studies need to encompass both gain- and loss-of- 
function analyses. This would determine whether the activity 
of a candidate target (an enzyme, for example), eliminated 
by molecular/cellular techniques, could reverse a disease 
phenotype. If this happened, then the investigator would 
have increased confidence that a small-molecule inhibitor 
against the target would also have a similar effect. The 
proposal of candidate drug targets is often not a difficult 
process, but validating them is another matter. Validation 
represents a major bottleneck where the wrong decision 
can have serious consequences 20 . 

Proteomics can be used to evaluate the role of a chosen 
target protein in signal transduction cascades directly rel- 
evant to the disease. In this manner, valuable information 
is forthcoming on the signalling pathways that are per- 
turbed by a target protein and how they might be cor- 
rected by appropriate therapeutics. Techniques that are 
well established in one-dimensional protein studies to in- 
vestigate signalling pathways, such as western blotting 
and immunoprecipitation, are highly suited to proteomic 
applications. For example, the proteomes obtained can be 
blotted onto membranes and probed with antibodies 
against the target protein or related signalling mol : 
ecules 21 " 23 . Because proteomics can resolve >2000 pro- 
teins on a single gel, it is possible to derive important 
information on specific isoforms (such as glycosylated or 
phosphorylated variants) of signalling molecules. This will 
result in characterization of how they are altered in the 
disease process. Western immunoblotting techniques 
using high-affinity antibodies will typically identify pro- 
teins present at -10 copies per cell (-1.7 fmol); this is in 
contrast to the best fluorescent dyes currently available 
that are limited to imaging proteins at 1000 or more 
copies per cell. The level of sensitivity derived by these 
applications will greatly facilitate interpretation of com- 
plex signalling pathways and contribute significantly to 
validation of the target under study. 

Immunoprecipitation studies 

Similarly, immunoprecipitation studies are another useful 
way to exploit the resolving power of proteomics 24 ' 25 . In 
this instance, very large quantities of protein (e.g. several 
milligrams) can be subjected to incubation with antibodies 
against chosen signalling molecules. This allows high-affin- 



ity capture of these proteins, which can subsequently be 
eluted and electrophoresed on a 2D gel to provide a high- 
resolution proteome of a specific subset of proteins. 
Detection by blot analysis allows the identification of ex- 
tremely small amounts of defined signalling molecules. 
Again, the different isoforms of even very low abundance 
proteins can be seen, and, very importantly, the technique 
allows the investigator to identify multiprotein complexes 
or other proteins that co-precipitate with the target protein. 
These coassociating proteins frequently represent sig- 
nalling partners for the target protein, and their identifi- 
cation by mass spectrometry can lead to invaluable infor- 
mation on the signalling processes involved. 

The depth of signal transduction analysis offered by 
proteomics, and the utility for target validation studies, 
can be extended even further by applying cell fraction- 
ation studies 26-28 . By purifying subcellular fractions, such 
as membrane, nuclear, organelle and cytosolic, it is possi- 
ble to assign a localization to proteins of interest and to 
follow their trafficking in a cell. Enrichment of these frac- 
tions will also allow much higher representation of low 
abundance proteins on the proteome. Their detection by 
fluorescent dyes or immunoblot techniques will lead to 
the identification of proteins in the range of 1-10 copies 
per cell, putting the sensitivity on a par with genomic 
approaches. 

These signal transduction analyses can be of additional 
value in experiments where inhibitors derived from a 
screening programme against the target are being evalu- 
ated for their potency and selectivity. The inhibitors can 
encompass small molecules, antisense nucleic acid con- 
structs, dominant-negative proteins, or neutralizing anti- 
bodies microinjected into cells. In each case, proteome 
analysis can provide unique data in support of validation 
studies for a chosen candidate drug target. 

Proteomics and drug mode-of-action studies 

Once a validated target is committed to a screening regi- 
men to identify and advance a lead molecule, it is impor- 
tant to confirm that the efficacy of the inhibitor is through 
the expected mechanism. Such mode-of-action studies are 
usually tackled by various cell biological and biochemical 
methods. Proteomics can also be usefully applied to these 
studies and this is illustrated below by describing data ob- 
tained with OGT719. This is a novel galactosyl derivative of 
the cytotoxic agent 5-fluorouracil (5-FU), which is currently 
being developed by OGS for the treatment of hepatocel- 
lular carcinoma and colorectal metastases localized 
in the liver. The premise underpinning the design and ra- 
tionale of OGT719 was to derive a 5-FU prodrug capable 
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Figure 4. Features that are specifically up- or dotunregulated in Huh 7 cells by either 5-fluorouracil (5-Fty or 
OGT719: (a) elongation factor la2, (b) novel (three peptides by MS-MS) and (c) a-subunit of prolyl-4-hydroxylase. 
Arrows indicate up- or downregulated. 



of targeting, and being retained in, cells bearing the asialo- 
glycoprotein receptor (ASGP-r), including hepatocytes 29 , 
hepatoma Huh7 cells 30 and some colorectal tumour cells 31 . 
The growth of the human hepatoma cell line Huh7 is in- 
hibited by 5-FU or by OGT719- If the inhibition by 
OGT719 were the result of uptake and conversion to 5-FU 
as the active component, then it would be expected that 
Huh7 cells would show similar proteome profiles follow- 
ing exposure to either drug. 

To examine these possibilities, we conducted an experi- 
ment taking samples of Huh7 cells that had been treated 
with IC 50 doses of either OGT719 or 5-FU. Total cell lysates 
were prepared and taken through 2D electrophoresis, 
fluorescence staining, digital imaging and Proteograph 
analysis. To facilitate the interpretation of the data across 
all of the 2291 features seen on the proteomes, drug- 
induced protein changes of fivefold or greater, identified 
by the Proteograph, were analysed further. Interestingly, 
from this analysis 19 identical proteins were changed five- 
fold or more by both drugs, strongly suggesting similarities 
in the mode of action for these two compounds. 

Thus, from very complex data involving >2000 protein 
features, using proteomics it is possible to analyse quanti- 
tatively and qualitatively each protein during its exposure 
to drugs. The biologist is now able to focus a series of fur- 
ther studies specifically on an enriched subset of proteins. 



Figure 4 shows highlighted examples of the selected areas 
of the proteome where some of these identified proteins in 
the above study are altered in response to either or both 
drugs. 

Several of the proteins identified above as being modu- 
lated similarly by 5-FU or OGT719 in Huh7 cells were sub- 
jected to tandem mass-spectrometric analysis for anno- 
tation. Some of these, such as the nuclear ribosomal 
RNA-binding protein 32 , can be placed into pyrimidine 
pathways or related cell cycle/growth biochemical path- 
ways in which 5-FU is known to act. 

To attribute further significance to the proteome mode- 
of-action studies with OGT719, another cell line, the rat 
sarcoma HSN, was used. Growth of these cells is inhibited 
by 5-FU, but they are completely refractory to OGT719; 
notably they lack the ASGP-r, which might explain this 
finding (unpublished). For our proteome studies, HSN 
cells were treated with 5-FU or OGT719 over a time course 
of one, two and four days. At each time point, cells were 
harvested and processed to derive proteomes and 
Proteographs. As before, we purposely focused on those 
proteins that increased or decreased by fivefold or more. 
In this instance, there were no proteins co-modulated by 
the two drugs. This is perhaps to be expected, given that 
the HSN cells are killed by 5-FU and yet are refractory to 
OGT719- 
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Clear potential 

The above is just an example of how proteomics can be 
used to address the mode of action of anticancer drugs. 
The potential of this approach is clear, and one can envis- 
age situations where it will be profitable to compare the 
proteomes of cells in which the drug target has been elimi- 
nated by molecular knockout techniques, or with small- 
molecule inhibitors believed to act specifically on the same 
target. In addition to using proteomics to examine the ac- 
tion of drugs, it is also possible to use this approach to 
gauge the extent of nonspecific effects that might eventu- 
ally lead to toxicity. For instance, in the example used 
above with HSN cells treated with OGT719, although cell 
growth was not affected, the levels of several specific pro- 
teins were changed. Further investigation of these proteins 
and the signalling pathways in which they are involved 
could be illuminating in predicting die likelihood or other- 
wise of long-term toxicity. 

Use of proteomics in formal drug 
toxicology studies 

A drug discovery programme at the stage where leads 
have been identified and mode-of-action studies are ad- 
vanced, will proceed to investigate the pharmacokinetic 
and toxicology profile of those agents. These two param- 
eters are of major importance in the drug discovery 
process, and many agents that have looked highly promis- 
ing from in vitro studies have subsequently failed because 
of insurmountable pharmacokinetic and/or toxicity prob- 
lems in vivo. Whereas the pharmacokinetic properties of a 
molecule can now be characterized quickly and accu- 
rately, toxicity studies are typically much longer and more 
demanding in their interpretation. 

The ability to achieve fast and accurate predictions of 
toxicity within an in vivo setting would represent a big 
step forward in accelerating any drug discovery pro- 
gramme. Toxicity from a drug can be manifested in any 
organ. However, because the liver and kidney are the 
major sites in the body responsible for metabolism and 
elimination of most drugs, it is informative to examine 
these particular organs in detail to provide early indi- 
cations about events that might result in toxicity. 

The basis for most xenobiotic metabolizing activity is to 
increase the hydrophilicity of the compound and so facili- 
tate its removal from the body. Most drugs are metabo- 
lized in the liver via the cytochrome P450 family of en- 
zymes, which are known to comprise a total of -200 
different members 33 34 , encompassing a wide array of 
overlapping specificities for different substrates. In addi- 
tion to clearance, they also play a major role in metabo- 



lism that can lead to the production and removal of toxic 
species, and in some instances it is possible to correlate 
the ability or failure to remove such a toxin with a specific 
P450 or subgroup. 

Unique P450 profiles 

Each individual person will have a slightly different P450 
profile, largely from polymorphisms and changes in ex- 
pression levels, although other genetic and environmental 
factors aside from P450 also need to be taken into consid- 
eration. A significant amount of research is currently 
being directed towards this field - known as pharmacoge- 
nomics - with the aim of predicting how a patient will re- 
spond to a drug, as determined by their genetic make- 
up 35-37 . The marked variation of individuals in their ability 
to clear a compound can be one of the key factors in de- 
ciding the overall pharmacokinetic profile of a drug. Not 
only will this have a bearing on the likelihood of a patient 
responding to a treatment, but it will also be a factor in 
determining the possibility of their experiencing an ad- 
verse effect. 

Many pharmaceutical companies are already employing 
genomic approaches, involving P450 measurements, as a 
key step in their assessment of the toxicological profile of 
a candidate drug and therefore of its suitability, or other- 
wise, to be considered for human clinical trials. There are 
limits to this approach, however. Whereas the P450 mRNA 
profiling can predict with some accuracy the likely meta- 
bolic fate of a drug, it will not provide information on 
whether the metabolites would subsequently lead to tox- 
icity. Besides the patient-to-patient differences in steady- 
state levels of the P450s, there are also characteristic induc- 
tion responses of these enzymes to some drugs. Moreover, 
as there can be some doubt over the correlation of mRNA 
levels and the corresponding protein levels, there is scope 
for misinterpretation of the results and hence real advan- 
tages to be gained from a proteome approach. In both in- 
stances, the ability to examine entire proteome profiles, in- 
cluding the P450 proteins, will be a significant advantage 
in understanding and predicting the metabolism and 
toxicological outcome of drugs. 

In addition to direct organ and tissue studies, the serum, 
which collects the majority of toxicity markers released 
from susceptible organs and tissues throughout the entire 
body, can be utilized. Serum is rich in nuclease activity 
and, as pharmacogenomics is not suited to deal with these 
samples, valuable markers of toxicity could go undetected. 
However, by using proteomics for these types of analyses, 
serum markers (and clusters thereoO are now accessible 
for evaluation as indicators of toxicity. 
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Pharmacoproteomics 

Proteomics can thus be used to add a new sphere of 
analysis to the study of toxicity at the protein level, and in 
the era of '-omics' there is a case to be made to adopt the 
term 'Pharmacoproteomics™'. Animals can be dosed with 
increasing levels of an experimental drug over time, and 
serum samples can be drawn for consecutive proteome 
analyses. Using this procedure, it should be possible to 
identify individual markers, or clusters thereof, that are 
dose related and correlate with the emergence and severity 
of toxicity. Markers might appear in the serum at a defined 
drug dose and time that are predictive of early toxicity 
within certain organs and if allowed to continue will have 
damaging consequences. These serum markers could sub- 
sequently be used to predict the response of each individ- 
ual and allow tailoring of therapy whereby optimal effi- 
cacy is achieved without adverse side effects being 
apparent. This application can obviously extend to track- 
ing toxicity of drugs in clinical trials where serum can be 
readily drawn and analysed. Surrogate markers for drug ef- 
ficacy could also be detected by this procedure and could 
facilitate the challenge of identifying patient classes who 
will respond favourably to a drug and at what dosage. 

Conclusions 

By contrast to the agents administered to patients in clini- 
cal wards, the process of drug discovery is not a prescrip- 
tive series of steps. The risks are high and there are long 
timelines to be endured before it is known whether a can- 
didate drug will succeed or fail. At each step of the drug 
discovery process there is often scope for flexibility in in- 
terpretation, which over many steps is cumulative. The 
pharmaceutical companies most likely to succeed in this 
environment are those that are able to make informed 
accurate decisions within an accelerated process. 

The genomics revolution has impacted very positively 
upon these issues and now has a powerful new partner in 
proteomics. The ability to undertake global analysis of pro- 
teins from a very wide diversity , of biological systems and 
to interrogate these in a high-throughput, systematic man- 
ner will add a significant new dimension to drug discov- 
ery. Each step of the process from target discovery to clini- 
cal trials is accessible to proteomics, often providing 
unique sets of data. Using the combination of genomics 
and proteomics, scientists can now see every dimension of 
their biological focus, from genes, mRNA, proteins and 
their subcellular localization. This will greatly assist our 
understanding of the fundamental mechanistic basis of 
human disease and allow new improved and speedier 
drug discovery strategies to be implemented. 
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Wilson et al. (2000) compared a large number 6f protein do- 
mains to one another in a pair-Wise fashion with respect W 
similarities in sequence, structure, and function: Using a hy- 
brid- functional classification scheme merging the ENZYME 
and FlyBase systems (Gelbart et al. 1997; Bairoch 2000), they 
found that precise function is not conserved below 30-^0% 
identity, although the broad functional class is usually pre- 
served for sequence identities as low as 20-25%, given that 
the sequences have the same fold. Their survey also reinforced 
the previously established genera] exponential relationship 
between structural and sequence similarity (Chothia and Lesk 
1986). 

Other Work on Establishing Relationships between 
Sequence, Structure, and Function 

Several other groups have studied the relationship between 
sequence, structure, and function in detail, attempting to de- 
termine the extent to which functional transference between 
matching proteins is feasible (Shah and Hunger 1997; Martin 
et al. 1998; Thornton et al. 1999, 2000; Zhang et al. 1999; 
Shapiro and Harris 2000; Todd et al. 2001). Orengo et al. 
(1999) analyzed protein families in the CATH database and 
concluded that > 96% of the folds in the PDB are associated 
with a single homologous family. By investigating enzymatic 
folds they also found that more than 95% of homologous 
families show either single or closely related functions. 



The ultimate goal of the genome projects is to determine the 
structure and function of all the newly identified gene prod- 
ucts. Fundamentally, this will be carried, out via annotation 
transfer, transferring the structural and functional annotation 
from an experimentally characterized protein, (as in a model 
organism such as Escherichia coll) to a predicted protein in a 
newly sequenced genome that shares similarity in sequence. 
The degree of annotation transferred will depend on the de- 
gree of sequence similarity. This process is shown schemati- 
cally in Figure 1. In this paper, we aim to address this major 
question in bioinformatics, specifically focusing on multi- 
domain proteins, as they make up the bulk of the proteome in 
eukaryotic organisms (Gerstein 1998). 

Our work is a direct outgrowth of two previous analyses-..- 
of ours that concentrated on single-domain proteins. In an 
earlier paper, we found that the different structural classes of 
the scop classification system iiave different propensities to 
carry out certain types of function (Hegyi and Gerstein 1999). 
In particular, while the alpha/beta folds were disproportion- 
ately associated with enzymes and all-alpha and small folds 
with non-enzymes, the alpha + beta structures had an equal 
tendency for both enzymatic and non-enzymatic functions. 
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Annotation Transfer for Genomics 




Figure 1 Schematic illustrating annotation transfer. This figure illustrates the process of annotation transferror a group of hypothetical TIM barrel 
proteins. The leftmost panel represents sequence comparisons between idealized barrel domains from a number of . organisms. The next panel 
shows analogous results for structural comparison, and the panel after that, functk>rwl. compart panel represents sequence 

comparisons between idealized multi-domain proteins that match over a single domain, the subject of much of this paper. 



Pawlowski et al. (2000) studied the relationship between se-' 
quence and functional similarity in the twilight zone of 10%- 
15% sequence similarity and found a clear correlation'be- 
tween the two, with functional similarity based on the E.C. 
classification of enzymes. 

Russell et al. (1997) analyzed binding sites in; proteins 
with similar 3D structures and estimated that 90% of new 
remote homolog have common binding sites and similar 
functions. Eisenstein et al. (2000) evaluated the first results 
from the structural genomics projects and found that in many 
instances the protein structure itself offers an important clue 
to its biological function : Stawiski et al. (2000) found that 
function could be predicted, rather successfully for just the 
proteases. Devos and Valencia (2000) presented a critical view 
of function transference between similar sequences, high- 
lighting the limitations of this process due to errors in data- 
bases and the inherent complexity of the relationship be- 
tween protein sequence-structure and' function that does not 
allow "simplistic interpretations." They also found that bind- 
ing sites are the least conserved features between related pro- 
teins while the catalytic activity of enzymes is the most con-' 
served one. 

Multi-Domain Proteins with Divergent Functions: 
How Common? 

Most of these previous investigations focused on single- 
domain proteins or did not distinguish between single- and 
multi-domain ones. It is not clear how the multi-domain pro- 
teins with various functions behave with respect to functional 
conservation; namely, whether they are more or less con- 
served than their single-domain counterparts. In particular, as 
shown in Figure 1, if one multi -domain protein shares a single 
domain fold with another one, it is not clear the degree to 
which the functional conservation of these proteins is con- 
strained by the shared part, and to what degree it is influenced 
by other domains that are not shared. 

Specific groups of proteins that have the same combina- 
tion of structural domains but dramatically different func- 
tions illustrate this situation. One example is the combination 



of theiSHS-domain (scop superfamily identifier 2.24.2) and 
the P-loop containing NTP hydrolase (3.29.1). While in 
higher organisms this combination is associated with presyn- 
aptic and tumor suppressor functions (SWISS-PROT names 
SP02.HUMAN and DLGI_DROME, respectively), in the lower 
Dictyostelium it was found . in myosin (MYSP.DICDI). An- 
other -example, is the combination of the FAD/NAD(P)- 
binding superfamily and FAEMinked reductases C-terminal 
superfamily (3.4.1 and 4.12.1 superfamilies, respectively). In 
one group of proteins they appear in enzymes of the oxido- 
reductase group (e.g. OXDA.CAEEL or PHHY.PSEAE), while 
in another they are found in a dissociation inhibitor (e.g. 
GDIA_HUMAN)i It should be noted that the proteins are not 
covered completely by the structural matches, so it is quite 
possible that the rest of them contain totally different do- 
mains that are responsible for the dramatically different func- 
tions. However, do these two examples show a rather rare or 
a more frequent phenomenon? How often do multi-domain 
proteins, sharing the same structural domain composition, 
differ in their functions? 

In this paper, we attempt to provide a comprehensive 
answer to this question. This is particularly timely given that 
most of the unknown proteins in eukaryotic genomes are 
multi -domain. We use the same approach as in our previous 
analyses, comparing the sequences of the structural domains 
in scop to those of SWISS-PROT using blastp. We focus on 
the functional divergence of single and multi-domain pro- 
teins, extending previous investigations of single-domain 
proteins. Also, in comparison to previous work, we focus 
more on non-enzymatic functions and scop structural super- 
families, instead of folds. 

RESULTS 

Our Approach to Functional 
and Structural Assignment 

We used the BLASTP program (version 2.0) (Altschul et al. 
1997) to identify the scop 1.39 (Murzin et al. 1995) structural 
domains in SWISS-PROT (version 37) (Bairoch and Apweiler 
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2000) with e = 10* 4 . We removed I -the hypothetical. and frag- 
ment proteins. This resulted in two'sets-of ^proteins.."* 



FOLD PAIRS 



Of the single-domain; matches; - only those that were -almost V 
completely covered with a match to a single structural do- 
main were selected. (The maximum number of uncovered n 
residues was set at 70 with an additional conditicm Jthat a : 
maximum of 40 residues on the N-terminal end and. 30 rest-" - • 
dues on the C-terminus were allowed to be uncovered.) These i 
criteria resulted in 1818 single-domairr proteins being selected 
from SWISS-PROT. i y'"u 

Multi-Domain '* . " f i \\ 

We selected 4763 multi-domain proteins from SWI&PROT;- 
All of these matched (in different locations)/at least two do- " 
mains of known stracture belonging to different scop super- _ 
families (see schematic in Figure 1). We also selected a. subset ; . 
of these proteins that have-almost their entire length covered 7 
by matches with structural domains (allowing again-a maxi- 
mum of 70 uncovered residues); This selection resulted in 
2829 proteins being selected from SWISS-PROT. (In^all cases, = 
duplicate matches were removed, i.e., a protein at a : certairi 
location matches only one structural domain.) " ~ 

We set out to compare these two sets of proteins for 
functional divergence. As previously, we divided functions 
into enzyme and non-enzyme (Hegyi and Gerstein 1999). En- 
zymatic functions were classified by the EC system (Bairoch 
2000). Comparisons of enzymatic functions .were treated the 
same way as in our earlier analyses, that is^ if they differ in the 
first three components of their .respective EC numbers, they 
were considered different. This implied that our analysis dealt 
with a total of. 1 12 enzymatic functions. Non-enzymatic func- 
tions were classified into 508 different categories based on a 
simple thesaurus we assembled of synonymous keywords 
drawn from SWISS-PROT description lines.. In addition, we 
created 49 categories for functions: that have an enzymatic 
component but which are not part of the EC system. This gave 
us a total of 669 functions (1 12 + 508 +, 49). (The list of all the 
functional categories is described, further; in Table 2 below, 
and also can be found on the Web at http://bioinfo. 
mbb.yale.edu/partslist/func-or http;//partslist;org/func.) .: 

Overall Distribution of the Matches 

Figure 2 shows the most commonly observed multi-domain 
combinations in a set of recently sequenced genomes. The 
occurrences of further combinations are available from the 
Web site. Clearly, the distribution is very skewed, with certain 
combinations, such as 3.29-2.32, and 2.29-4:61 tending to 
predominate. * ' 

Figure 3 shows the overall distribution of the single- 
domain and multi-domain matches in the different structural 
classes. The distribution of matches between enzymes arid 
non-enzymes in multi-domain proteins largely agrees with 
that in the single-domain proteins. The multi-domain 
matches follow the overall tendency of the alpha/beta folds to 
be associated with enzymes to a larger extent and the all- 
alpha and small folds with non-enzymes. : However, the values 
for the multi-domain matches are generally less extreme than 
for single-domains; for example, the 10-fold difference be- 
tween single-domain alpha/beta enzymes and non-enzymes 
decreases to about twofold in multi-domain proteins. Another 
significant difference is the reduction in the number of multi- 
domain non-enzymes in the all-beta and alpha + beta struc- 
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Figure 2 Distribution-^ murtt-domain combinations amongst the 
genomes. The figureishows the occurrence of multi-domain fold com- 
binations in a numte^ of genomes, indicating its great variability. 
Each row indicates a particular combination of scop fold pairs (using 
scop 1 .39), where a fold pair is defined as two distinct folds occurring 
in tandem in a protein. Each column represents a different genome, 
using the four-letter codes in the PartsList system (Qian et al. 2001)': 
Aaeo, »Aquil ex aeolkus; Aful, Archaeoglobus Julgidus;^ Bbur^BorreUa 
burgdorferi; Bsub, Bacillus subtil Cefy Coenorhabditis elegant; Cpne, 
Chlamydia pneumoniae; Gtra, : Chlamydia trachomatis; Ecol, Bcherischio 
cotf; Hinf, Haemophilus influenzae Rd; Hpyl, Helicobacter pylori; Mthe, 
Methanqbacterium therrnoautotrophicum; Mjan, Methanococcus tan- 
noseh//; MUib, Mycobacterium tuberculosis; Mgen, Mycoplasma gerii- 
taRum; Mpne, Mycoplasma pneumoniae; Phor, Pyrococcus horikoshii; 
Rpro, Rickettsia prowazekii; Scer, Saccharomyces cerevisia&, Syn*, Syn- 
ecnocystis sp.; Tpal, Treponema pallidum. The numbers in each Inter- 
section cell indicate the number of times the fold pairs occur in a 
genome. Only the 20 most common fold pair combinations are 
shown here; the remainder are shown on the Web site (http// 
partslistorg/func). If a cell is greater than 6, it is shaded black; be- 
tween 3 and 6, gray; and below 3, white. The blank spaces show 
instances in which one of the pairs does not occur in the organism at 
ati (indicated by a value of -1 in the data table on the Web site). The 
fold assignments are done in a fashion consistent with those in 
PartsList and associated systems (Gerstein 1997; Lin et at 2000; Dra- 
wid et al. 2001 ;. Harrison et al. 2001 ; Qian et al. 2001). 



tural classes compared to the single-domain matches. Alto- 
gether, there are more enzymes than non-enzymes among the 
multi-domain proteins (2805 enzymes vs. 1958 non-enzymes) 
whereas for single-domain proteins, the opposite is true (850 
enzymes vs. 968 non-enzymes). 

Table 1 summarizes the distribution of superfamilies and 
superfamily combinations among the major functional 
classes, i.e. whether they have only enzymatic, only non- 
enzymatic or both enzymatic and non-enzymatic functional- 
ity. Altogether, 215 superfamilies were found in single-domain 
proteins and 310 in multi-domain ones. As 70 superfamilies 
were found in both, altogether 455 distinct structural super- 
families matched a SWISS-PROT protein with our required 
coverage criteria (described above). Similarly, we apportioned 
the 281 superfamily combinations observed in multi-domain 
proteins amongst different broad functional categories. 

In single-domain proteins there are about as many su- 
perfamilies with exclusively enzymatic functionality as there 
are those with exclusively non-enzymatic functions (82 vs. 
78). In contrast, in multi-domain proteins this ratio increases 
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Figure 3 .Distribution of proteins amongst broad structural and 
functional classes; the distribution of the matches among the seven 
structural and two' functional classes in single- and multi-domain pro- 
teins.' The single-domain and multi-domain 'matches each total 
100%, independently of each other. The horizontal axis indicates the 
seven scop classes, which are (from ! to 7): all-alpha, all-beta, alpha/ 
beta, alpha + beta, multi-domain, membrane, and small protein. 

to almost threefold (135 vs. 56). This agrees with the notion 
that most enzymes are multi-domain. Another difference be- 
tween single and multi-domain proteins appears in the ratio ■. 
of superfamilies with a single function compared to- multi- 
functional ones. As it is apparent from Table 1, about a quar- 
ter of the superfamilies matched single-domain proteins with 
different functions (55 of 215), whereas in the multi-domain 
proteins, this ratio increased to more than a third (119 of 310). 

Single-Domain Proteins 

Table 2 lists the two functionaHy most diverse structural su- 
perfamilies in. single-aomain protems witti some representa- 
tive functions. The most diverse superfamily, the 3.38.1 
Thioredoxin-like, has 11 different functions associated with 
it, most of them with ah oxidoreductase mechanism. For in- 
stance, THIO_BPT4 is a small, disulphide-containing thiore- 
doxin that serves as a general disulphide oxidoreductase, 



while TOX21BK0K1A is almost '{wice long; (199 aa) and 
serves as a thibl-speclflc antioxidant that a^-agair^^lfur- 
containing radicals. Another interesting example of func- 
tional diversity is provided by the. Scorpion* toxin-like ^uper-^ 
fantfly (7l3.6): ^Je^ 

known to be 2000 toeslsweeter than sucrose, tie other mem- 
bers of the. su^rramily are assc^ host- 
defense 'mechanisms.^ insects the superfamily possesses 
antifungal actiyity (t>MYC_DROME) or acts as a toxin 
(SCX53lTOy). ; .I,nter«ti plants it tm'^vri^m 

antifungal <AF2B^SJNAL) or as an inhibitor of insect alpha- 
amylases * (S1A1.SORBI). It appears that many ^single^omain 
proteins are, toxins or allergens, or are related iti other Ways \o 
a hbst^eje^ T V f <■ ' ^1 . ' / 

Based on the : data we can also deterniine the probability 
of two' single-domain proteins that match domains in the 
same superfamiry cate^ out the' taank func- 

tion, .ysirig Bayeiv theorem: ; , . ^ . 

\ : ;«f|S)=:«F)«S|^ 

where^S. is the . probability that two proteins share the same 
superfamily, F is L the probability that two proteins riave the 
same function, and "F is the probability that two proteins do 
not have the same function. 1 Rearranging and simplifying the 
equation we, get: '" v • < f - i ■ -.. 

: P(F|S) ? 1/<1 + N(S,-W(N(S;R) > Y ; (2) 

where N is the number of times that the two events in the 
parentheses occur together in our database of 1818 single- 
domain proteins. This results in , 

, i ^/PtF|Sji.l7tt" +:85qi/125l6) =J68<&. ] ' 

That is, the probability that two single-domain proteins that 
have the same, superfamily structure have the same function 
(whether enzymatic or not) is about 2/3.* - ( .... ,. : , £i 

Multl-Pqmain Proteins' : -i.,/fi V.'.V,.;,''^- 

Table 3 lists the combinations of superfamilies that have been 
associated with. the greatest number of different functions in 
multi-domain proteins, with representative entries in SWISS- 
PROT. The combination with the greatest number of different 
functions is that of 1.95.1 and 7.33.1. Although it has twice as 
many different functions as the most diverse superfamily in 



Table 1. Functional Distribution of Single-domain, Mufti-domain Superfamilies, and 
Multi-domain Combinations 





Single-domain 
superfamilies 


Multi-domain 
superfamilies 


Multi-domain sfam 
combinations 


Single 
function 


Multiple 
function 


Single 
function 


Multiple 
function 


Single 
function 


Multiple 
- function 


Enzymatic 


82 


11 


135 


42 


151 


16 


Nonenzymatic 


78 


23 


56 


30 


70 


27 


Both functions 




15 




47 




17 


Total 


160 


55 


191 


119 


221 


60 



The basic functional distribution of the superfamilies in single- and multi-domain proteins and the 
functional distribution of multi-domain combinations are shown. The first row lists the number of 
scop superfamilies that were associated only with enzymatic function tn each category. The second 
row lists the number associated with only nonenzymatic functions, and the third row indicates the 
number of superfamilies that were associated with both types of function. Altogether, we charac- 
terized 160 + 55 = 215 single-domain and 191+119=310 multi-domain superfamilies, 70 of 
which overlapped in the two categories. 
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184# 
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The most versatile superfamilies In single-domain proteins as determined from their functional description in SWISS- 
PROT; with some representatives. The keyword combinations in the fourth column were based either on the first three 
com J? n ?"£ of their EC numbers (for enzymes) or derived automatically by comparing the DE description line of 
SWISS-PROT entries to a list of synonymous keywords at http://bioinfo.mbb.yale.edu/partsiist/func. A k^^ num- 
ber starting with a O indicates an enzyme that does not have an assigned EC number in its description in SVVISS-PROT 



the single-domain proteins (22 vs. 11, respectively), careful 
examination reveals that all the proteins in this category are 
DNA-binding and most of them act as hormone receptors. 

The second entry listed in the table is the combination of 
the 3.4.1 and 4.48.1 superfamilies associated with the FAD/ 
NAD(P)-linked reductases. It is an all-enzymatic combination 
and always carries out an oxido-reductase function. All the 
proteins in this category are completely covered by matches 
with these two superfamilies. The 1.78.1-2.1.1 hemocyanin- 
immunoglobulin combination seems also to be fairly con- 
served; although the proteins in this category are called by 
eight different names, most of them turn out to be extracel- 
lular larval storage proteins, except for the copper-containing 
oxygen carrier hemocyanin itself (HCY.PALVU). 

Following the same logic, we can also determine the 
probability that two proteins that have the same superfamily 
combination share the same function, viz: 

P(F|S) = 1/(1 + 32242/134230) = 81% 

This means that we have significantly greater certainty in de- 
termining the function of a multi-domain protein with a par- 
ticular superfamily combination than that of a single-domain 
protein containing a particular superfamily. We also deter- 
mined a similar probability for those proteins that have an 



almost complete coverage with exactly the same type and 
number of superfamilies, following each other in the same 
order. The probability that the functions are the same in this 
case was 91%, a considerably higher value than above. How- 
ever, if two multi-domain proteins share only a single super- 
family, the probability that they share the same function 
drops to only 35%! This greater functional certainty from 
sharing a combination of superfamilies rather than just one is 
also reflected in Table 1. While one-fourth of the single- 
domain proteins and one-third of singularly matching super- 
families in multi-domain proteins have multiple functions, 
only about one-fifth of the multi-domain combinations pos- 
sess multiple functions (60 of 281). It is also clear from the 
data that domains in larger proteins often lose their original 
function and no longer have an autonomous function. 

Seventy Common Superfamilies and Their 
Functions Compared in Single-Domain 
and Multi-Domain Proteins 

As mentioned above, of the 455 superfamilies in our analysis, 
only 70 occur in both single- and multi-domain proteins. 
Even more surprising is the small number of structural super- 
families (14) that have the same function in both single- and 
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multi-domain proteins. These are listed in Table 4; 12 of them 
have enzymatic function, supporting the notion that en- 
zymes are more conserved during evolution than non- 
enzymes. The two non-enzymatic superfamilies are the 4.29.1 
ribosomal superfamily and the 5.4.1 superfamily in penicillin- 
binding proteins. 

Table 5 presents several examples of the converse situa- 
tion, shared superfamilies that have different functions in 
single and multi-domain proteins. Comparing parts A and B 
of the table highlights the fact that although both superfami- 



lies in a multi-domain protein are often present in single- 
domain form as well, the functions in the different settings 
are only vaguely related. One example is the combination of 
the lipocalin superfamily (2.45.1) with that of the BPTMike or 
Kunitz inhibitor (7.7.1), which in higher organisms forms a 
complex protein called alpha-l-microglobulin (AMBP.RAT). 
Another interesting example is the combination of the 2.5.1 
Cupredoxin (occurring in the single-domain blue-copper pro- 
tein, SOXE.SULAC) and the 6.5.1 Membrane all-alpha 
(single-domain representative: BACT_HALVA, a sensory rho- 
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Multi-domain proteins 
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Sfam function , ID 



SWISS-PROT function 
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ID 



SWISS-PROT function 
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2.66^ 

3:17.2 
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3.67.1 

4.19.1 
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5.10.1 
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CUNY.ERWCH 
URE2.YERPS 
NADE_MYCPN 
PTP2 NPVOP 
TRPB.V1BPA 
FKB1_MFTjA 
LYCV BPP2 
RS5.ACYKS 
SNPA.STRCS 
URE3 YERPS 
KANU 5TAAU 
AMPH ECOU 



Endoglucanase (3,2.1.4) AMYC.NEUCR 

Urease Beta (3.5.1 .5) URE1 _HELPY 

NAD(+) Synthetase (6. 3.5.1 ) GUAA.YEAST 

Protein-Tyrosirie Phosphatase 2 (3.1.3.48) PTNB RAT 

Tryptophan Synthase (42.1 .20) TRP YEAST 

Peptidylprolyl Gs-Trans Isomerase (52.1 .8) FKB7_WHEAT 

Lysozyme (32.1 .1 7) CHIX PEA 

30s Ribosomal Protein S5 RS5 TREPA 

Extracellular Neutral Protease (3.424.-) BMPH.STRPU 

Urease Gamma (3.5.1.5) URE1_HELPY 

Kanamycin Nucleotidyltransferase (2.7.7.-) DPOB_XENLA 

Penicillin-binding Protein Amph PBPX_STRPN 



Clucoamylase Precursor (32.1 3) 
Urease Alpha Subunit (3.5.1.5) 
CMP Synthase (6.3.52) 
Proteln-fyrosine Phosphatase (3.1 .3.48) 
Tryptophan Synthase (42.1 20) 
70 Kd Peptidytproryl Isomerase (52.1.8) 
Endochitinase Precursor (32.1.14) 
30s Ribosomal Protein S5 
Collagenase 3 Precursor <3.4.24.-) 
Urease Alpha Subunit (3.5.1.5) 
Dna Polymerase Beta (2.7.7.7) 
Penicillin-binding Protein 3x Pbp2x 
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j^l*- 5 - Examples of Superf amines Present In Both Singte and Mum-Domahi Proteins, ?J r ' 



Sfam v „ 


Funct # 


vSWISS^PROTID 


SVyiSS-PROTfuncllpn 




35i# ' 

:v vl83# ; ; 

•El .17.4 
: r ' 192# .v : 




FemUn-fike Protein 2 , f^*.> v ^ -I. 
Nigerythrin ; : ... ; V- ^ V T ^ : 
(RibonudeotkJe Reduet^ ^ 
Ner-fikeProteih Homology v.\v c: Ov.: 






V^^lA^m;!> : i: ;; 


' . Hfetor>e ; H^'Sp^^Y^ 




\- -;-iisA' 


^^f^^y 


Farnesy!transtea«;Beta^^ ,: 




y'^226#. : - 

228#412#i 
229# /: : , 
> E5.3:99 .: 
;^ 230*421 # V 


; JNi^MOUSE 
iPCHbHUfw^^ 1 


EpjdloVmaJ-Tetin^c Add Bihdtrig Protein 

.., rau.j Mciu-Dinaing ?TOUMn;nomOJOQ 3 , . 

Neutrqphii^latJnas^ 
Nitrophorin 4 Preoirsbr , y vV ■. . 
' Prostablandin-H? TliIcn^noracA oo ov 

Vesomeral Secretory ■Prot*^?^Vku s .-L i I.'-; 




231# ■ 
; 232#427# 


- : /: : ^MW3-/W1B^L:^i: - 

> ^q^uEag ; ■ v 


Pollen Aileron AMB A3 (AMB A in) 
Sutfocyariin (Blue Copper Protein) ^ ^ , : . 


, 3:142 


373# - 


^ ^RRFIliDESVW 


' RrfVProteln ■ ky -.-V^ 




E6.3;4 
E2.7.4 
0259# 

E2.7.1 ; 


PUk^GAEEL 


Adenylosuccinate Synthetase (6.3.4.4) 
Thymidylate Kinase (2.7.4.9) 
Guanyiate Kinase HomoioV^ 
Thymidine Kinase (2.7 121 ), • r 


3.47.1 


275# 
276# 


: - mblJbagsu : 
v mreb^bagsu 


MBL Protein v " \ ■ v = 
Rod Shape^determining Protein Mreb " 


3.48.1 


E3.1.3 


PPA5^YEAST 


Repressible Acid Phosphatase (3.1 3.2) 


3.81.1 


D281# 
282# 


AMiGj>5EAE 
lUXP^vlBHA 


Aliphatic Amidase Expression-Regulator 
LUXP Protein Precursor 


4.103.1 


E2/4/2 


TOX11BORPE 


Pertussis Toxin Su 1 (2.42.-) 


4:105.1 


291# 


ugc^polmi 


Lectin-Poryandrocarpa Misakiensis 


4.11.5 : 


295# 


TERP.PSESP 


Terpredoxin ; 


4:19.1 


E5.2.1 


FKBI.METJA 


Pept-Prolyl Qs-Tmns Isomerase (52.1:8) 


6.5.1 • 


E3.6.1 
540#325# 


ATPL VIBAL 
BAGt.HALVA 


ATP Synthase (3.6.1.34) (Upid-binding) 
Sensory Rhodopsin II (Sr-li) 


7.35:4 


El .9.3 
345# 


COXB RAT 
DESR_DESBI 


Cytochrome C Oxidase (1:9.3.1) (Via*) 
besulforedoxin (Dx) 


7.7.1 


349# 


TAP/ORNMO 


Tick Anticoagulant Peptide 



(Table continues on following page.) 



dopsin) superfamilies into a component of the respiratory 
chain, cytochrome C oxidase II (COOX_ZOOAN). All these 
examples demonstrate the evolutionary advantage of a do- 
main fusion event, which creates a function that is more com- 
plex than either of the components. 

Multifunctionality vs. Sequence Similarity 

Previously, we presented a variety of graphs that show how 
the probability that two domains would share the same func- 
tion varied with respect to sequence similarity (Hegyi and 



Gerstein 1999; Wilson et al. 2000). Figure 4 shows a similar 
graph with the calculations extended to multi-domain pro- 
teins. The figure shows that the functional divergence of a 
single domain in multi-domain proteins dramatically in- 
creases, more than twofold, compared to the single-domain 
ones. This reinforces our findings above, based only on super- 
family content, that the certainty with which we can predict 
the function of a protein based on its sequence similarity with 
a domain in another multi-domain protein, is considerably 
less than for a comparable single-domain situation. 
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fabkSB. Multi-Domain Proteins 



Sfam Comb;" 


• :: FuncW 


SVVISS-PROT ID 


v SWISS-PRQT function 




^j04# 


RUBY MFTIA 


rutduve KUDreryuinn 


1:32^/3-81 V 


i * * ii# ■ - v 

vrc:582#li#H 


" PURR- HAFINI 

* DEGA^fiAGSU - 
- SCRR STRMU : 


Purine Nucleotide Synthesis Repressor 
. Degradation Activator \ 

Transtription Regulatory Protein Rega 


^1.4.3/3:14.2: 


: 10# v^C 

: 13# --v 
,Vl90#. : -;--:": : 
'•. -366#; .VV^ 


/> SKN7^YEAST^ 

VIRC ACRT5 V f 
■ RGX3 MYGTU 
^PFERJ>SEAiE ^ 

PEi^wio^ 


Transcription Factor Skh7 {P6s9 . Protein) *- : 
Virg Regulatory Protein a : ; 7 
^ Sensory Transductto Protein RECX3 
,_ transcriptional Activator Protein Pfer ; 
' ' " Petr Protein V- V *•:;>, f /' -■--.*:. • :^ - ■; y ] 


2:45.1/7>;V ; 


- 203#1^3#i^ 


^•HGjiA^;.vu ; ''^ 


; Alpr^^^ 


2.5^/631/ 




:CdX2i200AN 


v cyto^ 


3i9.1/3.48.r 


E2.7.1 


F26^RANC>7^- 


64>hosphtf^ .105) 


347.1 /5;1 7.1 


1# 

1#83# 


YEDO YEAST. V 
CR73.MAI2E 


V Heat Shock Protein 70 Homolog YEL030W 
' Ig-Brnding Protein 



DISCUSSION 

Here we built on our previous studies on the relationship 
between protein structure and function to develop new re- 
sults related to multi-domain proteins. Throughout the paper, 
we focused on superfamilies instead of folds, as the members 
of a superfamily are presumably of common evolutionary ori- 
gin (Murzin et al. 1995). 

We found that the 4763 multi-domain and 1818 single- 
domain proteins that met our selection criteria have about 
the same distribution of structural classes, with more enzy- 
matic functions associated with the alpha/beta structural 
classes and more non-enzymatic ones with the all-alpha and 
small classes. We identified more than three times as many 
multi-domain proteins that were enzymes than single- 
domain ones (2805 and 850, respectively) and, conversely, 
about twice as many multi-domain proteins as single-domain 
ones that were non-enzymes (1958 vs. 968). 

We focused on the functional divergence of the two 
groups and found that about a quarter of the superfamilies in 
single-domain proteins are associated with multiple func- 
tions, whereas only about a fifth of the multi-domain super- 
family combinations are. Therefore, we can conclude that a 
combination of specific superfamilies results in a more spe- 
cific functional assignment for a particular protein. However, 
about one-third of the superfamilies in the multi-domain pro- 
teins were associated with multiple functions, underlining 
the lesser autonomy of a domain function in multi-domain 
protein. 

This latter finding was also supported by the difference 
in functional divergences between the two groups of proteins 
based on particular sequence similarities between the do- 
mains and SWISS-PROT proteins. As is shown in Figure 4, the 
average functional divergence of a single domain is much 
larger (more than twofold) in multi-domain proteins than in 
single-domain ones. 

We also found that only 70 of a total of 455 superfamilies 
are shared between the multi-domain and single-domain pro- 
teins and only a small fraction (14) share their functions. This 



was rather surprising to us, and should be taken into consid- 
eration in functional characterization and annotation of new 
gene products. When the functions were related in single- and 
multi-domain proteins, we could observe an increasing func- 
tional complexity with the appearance of large multi-domain 
proteins. 

Altogether, with the recent sequencing of the human 
genome and the genomes of other model organisms, we hope 




0 20 40 60 80 

-log(e-value) 

Figure 4 Divergence in function with respect to sequence similar- 
ity. Relative number of matching domains with multiple functions, as 
the function of e-value threshold. Diamonds represent single-domain 
proteins, squares multi-domain ones (matching just for a single do- 
main), respectively. The first value on the X-axis starts at 4 (corre- 
sponding to an e-value=10~ 4 ). 
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that this work can contribute to the successful annotation of 
the individual gene products, and will help to avoid some 
pitfalls associated with the functional characterization of 
large, complex proteins. 

The publication costs of this article were defrayed in part 
by payment of page charges. This article must therefore be 
hereby marked "advertisement" in accordance with 18 USC 
section 1734 solely to indicate this fact. 
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